This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/BPF/
-
Target/
-
BPF/
-
BPF.h
-
BPFISelLowering.h
-
BPFISelLowering.cpp
-
BPFMIPeephole.cpp
-
BPFTargetMachine.cpp
-
test/CodeGen/BPF/
-
CodeGen/
-
BPF/
-
remove_truncate_9.ll

Differential D157870

[BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()
ClosedPublic

Authored by eddyz87 on Aug 14 2023, 7:07 AM.

Download Raw Diff

Details

Reviewers

yonghong-song

Commits

rG651e644595b7: [BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()

Summary

Replace BPFMIPeepholeTruncElim by adding an overload for TargetLowering::isZExtFree() aware that zero extension is free for ISD::LOAD.

Short description

The BPFMIPeepholeTruncElim handles two patterns:

Pattern #1:

%1 = LDB %0, ...              %1 = LDB %0, ...
%2 = AND_ri %1, 0xff      ->  %2 = MOV_ri %1    <-- (!)

Pattern #2:

bb.1:                         bb.1:
  %a = LDB %0, ...              %a = LDB %0, ...
  br %bb3                       br %bb3
bb.2:                         bb.2:
  %b = LDB %0, ...        ->    %b = LDB %0, ...
  br %bb3                       br %bb3
bb.3:                         bb.3:
  %1 = PHI %a, %b               %1 = PHI %a, %b
  %2 = AND_ri %1, 0xff          %2 = MOV_ri %1  <-- (!)

Plus variations:

AND_ri_32 instead of AND_ri
SLL/SLR instead of AND_ri
LDH, LDW, LDB32, LDH32, LDW32

Both patterns could be handled by built-in transformations at instruction selection phase if suitable isZExtFree() implementation is provided. The idea is borrowed from ARMTargetLowering::isZExtFree.

When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM test cases this revisions performs slightly better than BPFMIPeepholeTruncElim, see "Impact" section below for details.

Commit also adds a few test cases to make sure that patterns in question are handled.

Long description

Why this works: Pattern #1

Consider the following example:

define i1 @foo(ptr %p) {
entry:
  %a = load i8, ptr %p, align 1
  %cond = icmp eq i8 %a, 0
  ret i1 %cond
}

Log for llc -mcpu=v2 -mtriple=bpfel -debug-only=isel command:

...
Type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
      t19: i64 = and t16, Constant:i64<255>
    t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.1 t19: i64 = and t16, Constant:i64<255>
With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
 and 0 other values
...
Optimized type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
        t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
    t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...

Note:

Optimized type-legalized selection DAG:
- t19 = and t16, 255 had been replaced by t16 (load).
- Patterns like (and (load ... i8), 255) are replaced by load in DAGCombiner::BackwardsPropagateMask called from DAGCombiner::visitAND.
- Similarly patterns like (shl (srl ..., 56), 56) are replaced by (and ..., 255) in DAGCombiner::visitSRL (this function is huge, look for TLI.shouldFoldConstantShiftPairToMask() call).

Why this works: Pattern #2

Consider the following example:

define i1 @foo(ptr %p) {
entry:
  %a = load i8, ptr %p, align 1
  br label %next

next:
  %cond = icmp eq i8 %a, 0
  ret i1 %cond
}

Consider log for llc -mcpu=v2 -mtriple=bpfel -debug-only=isel command.
Log for first basic block:

Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 9 nodes:
  t0: ch,glue = EntryToken
  t3: i64 = Constant<0>
        t2: i64,ch = CopyFromReg t0, Register:i64 %1
      t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
    t6: i64 = zero_extend t5
  t8: ch = CopyToReg t0, Register:i64 %0, t6
...
Replacing.1 t6: i64 = zero_extend t5
With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
 and 0 other values
...
Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 7 nodes:
  t0: ch,glue = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %1
    t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
  t8: ch = CopyToReg t0, Register:i64 %0, t9

Note:

Initial selection DAG:
- %a = load ... is lowered as t6 = (zero_extend (load ...)) w/o special isZExtFree() overload added by this commit it is instead lowered as t6 = (any_extend (load ...)).
- The decision to generate zero_extend or any_extend is done in RegsForValue::getCopyToRegs called from SelectionDAGBuilder::CopyValueToVirtualRegister:
  - if isZExtFree() for load returns true zero_extend is used;
  - any_extend is used otherwise.
Optimized lowered selection DAG:
- t6 = (any_extend (load ...)) is replaced by t9 = load ..., zext from i8. This is done by DagCombiner.cpp:tryToFoldExtOfLoad() called from DAGCombiner::visitZERO_EXTEND.

Log for second basic block:

Initial selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: i64 = AssertZext t2, ValueType:ch:i8
        t5: i8 = truncate t4
      t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
    t9: i64 = any_extend t8
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.2 t18: i64 = and t4, Constant:i64<255>
With: t4: i64 = AssertZext t2, ValueType:ch:i8
...
Type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t4: i64 = AssertZext t2, ValueType:ch:i8
      t18: i64 = and t4, Constant:i64<255>
    t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Optimized type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
        t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t4: i64 = AssertZext t2, ValueType:ch:i8
    t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...

Note:

Initial selection DAG:
- t0 is an input value for this basic block, it corresponds load instruction (t9) from the first basic block.
- It is accessed within basic block via t4 (AssertZext (CopyFromReg t0, ...)).
- The AssertZext is generated by RegsForValue::getCopyFromRegs called from SelectionDAGBuilder::getCopyFromRegs, it is generated only when LiveOutInfo with known number of leading zeros is present for t0.
- Known register bits in LiveOutInfo are computed by SelectionDAG::computeKnownBits called from SelectionDAGISel::ComputeLiveOutVRegInfo.
- computeKnownBits() generates leading zeros information for (load ..., zext from ...) but *does not* generate leading zeros information for (load ..., anyext from ...). This is why isZExtFree() added in this commit is important.
Type-legalized selection DAG:
- t5 = truncate t4 is replaced by t18 = and t4, 255
Optimized type-legalized selection DAG:
- t18 = and t4, 255 is replaced by t4, this is done by DAGCombiner::SimplifyDemandedBits called from DAGCombiner::visitAND, which simplifies patterns like (and (assertzext ...)).

Impact

This change covers all remove_truncate_*.ll test cases:

for -mcpu=v4 there are no changes in the generated code;
for -mcpu=v2 code generated for remove_truncate_7 and remove_truncate_8 improved slightly, for other tests it is unchanged.

For remove_truncate_7:

Before this revision                 After this revision
--------------------                 -------------------
    r1 <<= 0x20                          r1 <<= 0x20
    r1 >>= 0x20                          r1 >>= 0x20
    if r1 == 0x0 goto +0x2 <LBB0_2>      if r1 == 0x0 goto +0x2 <LBB0_2>
    r1 = *(u32 *)(r2 + 0x0)              r0 = *(u32 *)(r2 + 0x0)
    goto +0x1 <LBB0_3>                   goto +0x1 <LBB0_3>
<LBB0_2>:                            <LBB0_2>:
    r1 = *(u32 *)(r2 + 0x4)              r0 = *(u32 *)(r2 + 0x4)
<LBB0_3>:                            <LBB0_3>:
    r0 = r1                              exit
    exit

For remove_truncate_8:

Before this revision                 After this revision
--------------------                 -------------------
    r2 = *(u32 *)(r1 + 0x0)              r2 = *(u32 *)(r1 + 0x0)
    r3 = r2                              r3 = r2
    r3 <<= 0x20                          r3 <<= 0x20
    r4 = r3                              r3 s>>= 0x20
    r4 s>>= 0x20
    if r4 s> 0x2 goto +0x5 <LBB0_3>      if r3 s> 0x2 goto +0x4 <LBB0_3>
    r4 = *(u32 *)(r1 + 0x4)              r3 = *(u32 *)(r1 + 0x4)
    r3 >>= 0x20
    if r3 >= r4 goto +0x2 <LBB0_3>       if r2 >= r3 goto +0x2 <LBB0_3>
    r2 += 0x2                            r2 += 0x2
    *(u32 *)(r1 + 0x0) = r2              *(u32 *)(r1 + 0x0) = r2
<LBB0_3>:                            <LBB0_3>:
    r0 = 0x3                             r0 = 0x3
    exit                                 exit

For kernel BPF selftests statistics is as follows: (-mcpu=v4):

For -mcpu=v4: 9 out of 655 object files have differences, in all cases total number of instructions marginally decreased (-27 instructions).
For -mcpu=v2: 9 out of 655 object files have differences:
- For 19 object files number of instruction decreased (-129 instruction in total): some redundant rX &= 0xffff and register to register assignments removed;
- For 2 object files number of instructions increased +2 instructions in each file.

Both -mcpu=v2 instruction increases could be reduced to the same example:

define void @foo(ptr %p) {
entry:
  %a = load i32, ptr %p, align 4
  %b = sext i32 %a to i64
  %c = icmp ult i64 1, %b
  br i1 %c, label %next, label %end

next:
  call void inttoptr (i64 62 to ptr)(i32 %a)
  br label %end

end:
  ret void
}

Note that this example uses value loaded to %a both as a sign extended (%b) and as zero extended (%a passed as parameter). Here is the difference in final assembly code:

Before this revision          After this revision
--------------------          -------------------
    r1 = *(u32 *)(r1 + 0)         r1 = *(u32 *)(r1 + 0)
    r1 <<= 32                     r1 <<= 32
    r1 s>>= 32                    r1 s>>= 32
    if r1 < 2 goto <LBB0_2>       if r1 < 2 goto <LBB0_2>
                                  r1 <<= 32
                                  r1 >>= 32
    call 62                       call 62
<LBB0_2>:                     <LBB0_2>:
    exit                          exit

Before this commit %a is passed to call as a sign extended value, after this commit %a is passed to call as a zero extended value, both are correct as 32-bit sub-register is the same.

The difference comes from DAGCombiner operation on the initial DAG.
Initial selection DAG before this commit:

t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
      t6: i64 = any_extend t5         <--------------------- (1)
    t8: ch = CopyToReg t0, Register:i64 %0, t6
        t9: i64 = sign_extend t5
      t12: i1 = setcc Constant:i64<1>, t9, setult:ch

Initial selection DAG after this commit:

t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
      t6: i64 = zero_extend t5        <--------------------- (2)
    t8: ch = CopyToReg t0, Register:i64 %0, t6
        t9: i64 = sign_extend t5
      t12: i1 = setcc Constant:i64<1>, t9, setult:ch

The node t9 is processed before node t6 and load instruction is combined to load with sign extension:

Replacing.1 t9: i64 = sign_extend t5
With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
 and 0 other values
Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
With: t31: i32 = truncate t30
 and 1 other values

This is done by DAGCombiner.cpp:tryToFoldExtOfLoad called from DAGCombiner::visitSIGN_EXTEND. Note that t5 is used by t6 which is any_extend in (1) and zero_extend in (2). tryToFoldExtOfLoad() rewrites such uses of t5 differently:

any_extend is simply removed
zero_extend is replaced by and t30, 0xffffffff, which is later converted to a pair of shifts. This pair of shifts survives till the end of translation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

eddyz87 created this revision.Aug 14 2023, 7:07 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 14 2023, 7:07 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

Harbormaster completed remote builds in B252340: Diff 549918.Aug 14 2023, 8:42 AM

Rebase, added detailed commit message.

Herald added subscribers: steven.zhang, kristof.beyls, arichardson. · View Herald TranscriptAug 17 2023, 4:35 PM

Commit message fixes.

eddyz87 edited the summary of this revision. (Show Details)Aug 17 2023, 4:44 PM

Coincidentally, this also fixes the following bug:
when LLVM is compiled with LLVM_ENABLE_EXPENSIVE_CHECKS some BPF kernel selftests fail to build with the following error:

*** Bad machine code: Illegal virtual register for instruction ***
- function:    _dissect
- basic block: %bb.9 if.end10 (0x621000e80d98)
- instruction: %24:gpr32 = MOV_rr %8:gpr32, debug-location !339; progs/bpf_flow.c:120:2
- operand 0:   %24:gpr32
Expected a GPR register, but got a GPR32 register

Using llvm-reduce it is possible to isolate the following test case:

define i1 @foo(ptr %p) {
entry:
  %short = load i16, ptr %p, align 2
  br label %next

next:
  %cond = icmp eq i16 %short, 0
  ret i1 %cond
}

Here is how this code looks before and after BPFMIPeepholeTruncElim transformation:

  Before this revision                      After this revision
  --------------------                      -------------------
bb.0.entry:                               bb.0.entry:
  %1:gpr = COPY $r1                         %1:gpr = COPY $r1
  %0:gpr32 = LDH32 %1:gpr, 0                %0:gpr32 = LDH32 %1:gpr, 0

bb.1.next:                                bb.1.next:
  %2:gpr32 = AND_ri_32 %0:gpr32, 65535      %2:gpr32 = MOV_rr %0:gpr32
             ^^^^^^^^^                                 ^^^^^^
  %4:gpr32 = MOV_ri_32 1                    %4:gpr32 = MOV_ri_32 1
  JEQ_ri_32 %2:gpr32, 0, %bb.3              JEQ_ri_32 %2:gpr32, 0, %bb.3

Note the MOV_rr instruction used with 32-bit sub-registers introduced by the transformation.
This happens because of the following code:

bool BPFMIPeepholeTruncElim::eliminateTruncSeq() {
  MachineInstr* ToErase = nullptr;
  bool Eliminated = false;

  for (MachineBasicBlock &MBB : *MF) {
    for (MachineInstr &MI : MBB) {
      ...
      BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(BPF::MOV_rr), DstReg)
              .addReg(SrcReg);
      ...
    }
  }

  return Eliminated;
}

The basic fix is to select between MOV_rr and MOV_rr_32 in the above snippet. This fix leads to some size increase in the object files generated for BPF selftests: out of 655 object files 30 have more instructions, in total 107 more instructions is generated.

The increase is caused by a difference in BPFMIPreEmitPeephole::eliminateRedundantMov() behavior: it removes operations of form rA = MOV_rr rA, but does not remove wA = MOV_rr_32 wA (which is unsafe). E.g. for the running example:

  Before this revision                      After this revision
  --------------------                      -------------------
bb.0.entry:                               bb.0.entry:
  $w1 = LDH32 killed $r1, 0                 $w1 = LDH32 killed $r1, 0
  $w1 = MOV_rr killed $w1   <--- (1)        $w1 = MOV_rr_32 killed $w1   <--- (2)
  $w0 = MOV_ri_32 1                         $w0 = MOV_ri_32 1
  JEQ_ri_32 killed $w1, 0, %bb.2            JEQ_ri_32 killed $w1, 0, %bb.2
bb.1.next:                                bb.1.next:
  $w0 = MOV_ri_32 0                         $w0 = MOV_ri_32 0
bb.2.next:                                bb.2.next:
  RET implicit $w0                          RET implicit $w0

eliminateRedundantMov() will remove (1) but won't remove (2).

So, all-in-all I think that this revision has advantage over the basic fix.

Hi Yonghong, could you please take a look?
As described here, this fixes one of the selftest compilation bugs that occur when LLVM_ENABLE_EXPENSIVE_CHECKS is on. Sorry for the long description, I tried to explain why instruction selection would always cover cases that BPFMIPeepholeTruncElim covered.

Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2023, 4:52 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

eddyz87 edited the summary of this revision. (Show Details)Aug 17 2023, 5:01 PM

Harbormaster completed remote builds in B253345: Diff 551320.Aug 17 2023, 5:33 PM

Thanks, Eduard. Your change makes sense. I wish I knew isZExtFree() for load insn much earlier so we do not need to have pass to remove redundant codes....

This revision is now accepted and ready to land.Aug 18 2023, 11:37 AM

Rebase, want to see green CI build

Harbormaster completed remote builds in B253889: Diff 552082.Aug 21 2023, 1:05 PM

Closed by commit rG651e644595b7: [BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree() (authored by eddyz87). · Explain WhyAug 21 2023, 2:13 PM

This revision was automatically updated to reflect the committed changes.

eddyz87 added a commit: rG651e644595b7: [BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree().

Revision Contents

Path

Size

llvm/

lib/

Target/

BPF/

4 lines

1 line

12 lines

177 lines

2 lines

test/

CodeGen/

BPF/

remove_truncate_9.ll

81 lines

Diff 552128

llvm/lib/Target/BPF/BPF.h

	Show All 17 Lines
	class BPFTargetMachine;			class BPFTargetMachine;
	class PassRegistry;			class PassRegistry;

	ModulePass *createBPFCheckAndAdjustIR();			ModulePass *createBPFCheckAndAdjustIR();

	FunctionPass *createBPFISelDag(BPFTargetMachine &TM);			FunctionPass *createBPFISelDag(BPFTargetMachine &TM);
	FunctionPass *createBPFMISimplifyPatchablePass();			FunctionPass *createBPFMISimplifyPatchablePass();
	FunctionPass *createBPFMIPeepholePass();			FunctionPass *createBPFMIPeepholePass();
	FunctionPass *createBPFMIPeepholeTruncElimPass();
	FunctionPass *createBPFMIPreEmitPeepholePass();			FunctionPass *createBPFMIPreEmitPeepholePass();
	FunctionPass *createBPFMIPreEmitCheckingPass();			FunctionPass *createBPFMIPreEmitCheckingPass();

	void initializeBPFCheckAndAdjustIRPass(PassRegistry&);			void initializeBPFCheckAndAdjustIRPass(PassRegistry&);
	void initializeBPFDAGToDAGISelPass(PassRegistry &);			void initializeBPFDAGToDAGISelPass(PassRegistry &);
	void initializeBPFMIPeepholePass(PassRegistry&);			void initializeBPFMIPeepholePass(PassRegistry &);
	void initializeBPFMIPeepholeTruncElimPass(PassRegistry &);
	void initializeBPFMIPreEmitCheckingPass(PassRegistry&);			void initializeBPFMIPreEmitCheckingPass(PassRegistry&);
	void initializeBPFMIPreEmitPeepholePass(PassRegistry &);			void initializeBPFMIPreEmitPeepholePass(PassRegistry &);
	void initializeBPFMISimplifyPatchablePass(PassRegistry &);			void initializeBPFMISimplifyPatchablePass(PassRegistry &);

	class BPFAbstractMemberAccessPass			class BPFAbstractMemberAccessPass
	: public PassInfoMixin<BPFAbstractMemberAccessPass> {			: public PassInfoMixin<BPFAbstractMemberAccessPass> {
	BPFTargetMachine *TM;			BPFTargetMachine *TM;

	Show All 28 Lines

llvm/lib/Target/BPF/BPFISelLowering.h

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	private:
// type Ty1 to type Ty2. e.g. On BPF at alu32 mode, it's free to truncate		// type Ty1 to type Ty2. e.g. On BPF at alu32 mode, it's free to truncate
// a i64 value in register R1 to i32 by referencing its sub-register W1.		// a i64 value in register R1 to i32 by referencing its sub-register W1.
bool isTruncateFree(Type Ty1, Type Ty2) const override;		bool isTruncateFree(Type Ty1, Type Ty2) const override;
bool isTruncateFree(EVT VT1, EVT VT2) const override;		bool isTruncateFree(EVT VT1, EVT VT2) const override;

// For 32bit ALU result zext to 64bit is free.		// For 32bit ALU result zext to 64bit is free.
bool isZExtFree(Type Ty1, Type Ty2) const override;		bool isZExtFree(Type Ty1, Type Ty2) const override;
bool isZExtFree(EVT VT1, EVT VT2) const override;		bool isZExtFree(EVT VT1, EVT VT2) const override;
		bool isZExtFree(SDValue Val, EVT VT2) const override;

unsigned EmitSubregExt(MachineInstr &MI, MachineBasicBlock *BB, unsigned Reg,		unsigned EmitSubregExt(MachineInstr &MI, MachineBasicBlock *BB, unsigned Reg,
bool isSigned) const;		bool isSigned) const;

MachineBasicBlock * EmitInstrWithCustomInserterMemcpy(MachineInstr &MI,		MachineBasicBlock * EmitInstrWithCustomInserterMemcpy(MachineInstr &MI,
MachineBasicBlock *BB)		MachineBasicBlock *BB)
const;		const;

};		};
}		}

#endif		#endif

llvm/lib/Target/BPF/BPFISelLowering.cpp

	Show First 20 Lines • Show All 218 Lines • ▼ Show 20 Lines
	bool BPFTargetLowering::isZExtFree(EVT VT1, EVT VT2) const {			bool BPFTargetLowering::isZExtFree(EVT VT1, EVT VT2) const {
	if (!getHasAlu32() \|\| !VT1.isInteger() \|\| !VT2.isInteger())			if (!getHasAlu32() \|\| !VT1.isInteger() \|\| !VT2.isInteger())
	return false;			return false;
	unsigned NumBits1 = VT1.getSizeInBits();			unsigned NumBits1 = VT1.getSizeInBits();
	unsigned NumBits2 = VT2.getSizeInBits();			unsigned NumBits2 = VT2.getSizeInBits();
	return NumBits1 == 32 && NumBits2 == 64;			return NumBits1 == 32 && NumBits2 == 64;
	}			}

				bool BPFTargetLowering::isZExtFree(SDValue Val, EVT VT2) const {
				EVT VT1 = Val.getValueType();
				if (Val.getOpcode() == ISD::LOAD && VT1.isSimple() && VT2.isSimple()) {
				MVT MT1 = VT1.getSimpleVT().SimpleTy;
				MVT MT2 = VT2.getSimpleVT().SimpleTy;
				if ((MT1 == MVT::i8 \|\| MT1 == MVT::i16 \|\| MT1 == MVT::i32) &&
				(MT2 == MVT::i32 \|\| MT2 == MVT::i64))
				return true;
				}
				return TargetLoweringBase::isZExtFree(Val, VT2);
				}

	BPFTargetLowering::ConstraintType			BPFTargetLowering::ConstraintType
	BPFTargetLowering::getConstraintType(StringRef Constraint) const {			BPFTargetLowering::getConstraintType(StringRef Constraint) const {
	if (Constraint.size() == 1) {			if (Constraint.size() == 1) {
	switch (Constraint[0]) {			switch (Constraint[0]) {
	default:			default:
	break;			break;
	case 'w':			case 'w':
	return C_RegisterClass;			return C_RegisterClass;
	▲ Show 20 Lines • Show All 672 Lines • Show Last 20 Lines

llvm/lib/Target/BPF/BPFMIPeephole.cpp

	Show First 20 Lines • Show All 600 Lines • ▼ Show 20 Lines
	INITIALIZE_PASS(BPFMIPreEmitPeephole, "bpf-mi-pemit-peephole",			INITIALIZE_PASS(BPFMIPreEmitPeephole, "bpf-mi-pemit-peephole",
	"BPF PreEmit Peephole Optimization", false, false)			"BPF PreEmit Peephole Optimization", false, false)

	char BPFMIPreEmitPeephole::ID = 0;			char BPFMIPreEmitPeephole::ID = 0;
	FunctionPass* llvm::createBPFMIPreEmitPeepholePass()			FunctionPass* llvm::createBPFMIPreEmitPeepholePass()
	{			{
	return new BPFMIPreEmitPeephole();			return new BPFMIPreEmitPeephole();
	}			}

	STATISTIC(TruncElemNum, "Number of truncation eliminated");

	namespace {

	struct BPFMIPeepholeTruncElim : public MachineFunctionPass {

	static char ID;
	const BPFInstrInfo *TII;
	MachineFunction *MF;
	MachineRegisterInfo *MRI;

	BPFMIPeepholeTruncElim() : MachineFunctionPass(ID) {
	initializeBPFMIPeepholeTruncElimPass(*PassRegistry::getPassRegistry());
	}

	private:
	// Initialize class variables.
	void initialize(MachineFunction &MFParm);

	bool eliminateTruncSeq();

	public:

	// Main entry point for this pass.
	bool runOnMachineFunction(MachineFunction &MF) override {
	if (skipFunction(MF.getFunction()))
	return false;

	initialize(MF);

	return eliminateTruncSeq();
	}
	};

	static bool TruncSizeCompatible(int TruncSize, unsigned opcode)
	{
	if (TruncSize == 1)
	return opcode == BPF::LDB \|\| opcode == BPF::LDB32;

	if (TruncSize == 2)
	return opcode == BPF::LDH \|\| opcode == BPF::LDH32;

	if (TruncSize == 4)
	return opcode == BPF::LDW \|\| opcode == BPF::LDW32;

	return false;
	}

	// Initialize class variables.
	void BPFMIPeepholeTruncElim::initialize(MachineFunction &MFParm) {
	MF = &MFParm;
	MRI = &MF->getRegInfo();
	TII = MF->getSubtarget<BPFSubtarget>().getInstrInfo();
	LLVM_DEBUG(dbgs() << "* BPF MachineSSA TRUNC Elim peephole pass *\n\n");
	}

	// Reg truncating is often the result of 8/16/32bit->64bit or
	// 8/16bit->32bit conversion. If the reg value is loaded with
	// masked byte width, the AND operation can be removed since
	// BPF LOAD already has zero extension.
	//
	// This also solved a correctness issue.
	// In BPF socket-related program, e.g., __sk_buff->{data, data_end}
	// are 32-bit registers, but later on, kernel verifier will rewrite
	// it with 64-bit value. Therefore, truncating the value after the
	// load will result in incorrect code.
	bool BPFMIPeepholeTruncElim::eliminateTruncSeq() {
	MachineInstr* ToErase = nullptr;
	bool Eliminated = false;

	for (MachineBasicBlock &MBB : *MF) {
	for (MachineInstr &MI : MBB) {
	// The second insn to remove if the eliminate candidate is a pair.
	MachineInstr *MI2 = nullptr;
	Register DstReg, SrcReg;
	MachineInstr *DefMI;
	int TruncSize = -1;

	// If the previous instruction was marked for elimination, remove it now.
	if (ToErase) {
	ToErase->eraseFromParent();
	ToErase = nullptr;
	}

	// AND A, 0xFFFFFFFF will be turned into SLL/SRL pair due to immediate
	// for BPF ANDI is i32, and this case only happens on ALU64.
	if (MI.getOpcode() == BPF::SRL_ri &&
	MI.getOperand(2).getImm() == 32) {
	SrcReg = MI.getOperand(1).getReg();
	if (!MRI->hasOneNonDBGUse(SrcReg))
	continue;

	MI2 = MRI->getVRegDef(SrcReg);
	DstReg = MI.getOperand(0).getReg();

	if (!MI2 \|\|
	MI2->getOpcode() != BPF::SLL_ri \|\|
	MI2->getOperand(2).getImm() != 32)
	continue;

	// Update SrcReg.
	SrcReg = MI2->getOperand(1).getReg();
	DefMI = MRI->getVRegDef(SrcReg);
	if (DefMI)
	TruncSize = 4;
	} else if (MI.getOpcode() == BPF::AND_ri \|\|
	MI.getOpcode() == BPF::AND_ri_32) {
	SrcReg = MI.getOperand(1).getReg();
	DstReg = MI.getOperand(0).getReg();
	DefMI = MRI->getVRegDef(SrcReg);

	if (!DefMI)
	continue;

	int64_t imm = MI.getOperand(2).getImm();
	if (imm == 0xff)
	TruncSize = 1;
	else if (imm == 0xffff)
	TruncSize = 2;
	}

	if (TruncSize == -1)
	continue;

	// The definition is PHI node, check all inputs.
	if (DefMI->isPHI()) {
	bool CheckFail = false;

	for (unsigned i = 1, e = DefMI->getNumOperands(); i < e; i += 2) {
	MachineOperand &opnd = DefMI->getOperand(i);
	if (!opnd.isReg()) {
	CheckFail = true;
	break;
	}

	MachineInstr *PhiDef = MRI->getVRegDef(opnd.getReg());
	if (!PhiDef \|\| PhiDef->isPHI() \|\|
	!TruncSizeCompatible(TruncSize, PhiDef->getOpcode())) {
	CheckFail = true;
	break;
	}
	}

	if (CheckFail)
	continue;
	} else if (!TruncSizeCompatible(TruncSize, DefMI->getOpcode())) {
	continue;
	}

	BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(BPF::MOV_rr), DstReg)
	.addReg(SrcReg);

	if (MI2)
	MI2->eraseFromParent();

	// Mark it to ToErase, and erase in the next iteration.
	ToErase = &MI;
	TruncElemNum++;
	Eliminated = true;
	}
	}

	return Eliminated;
	}

	} // end default namespace

	INITIALIZE_PASS(BPFMIPeepholeTruncElim, "bpf-mi-trunc-elim",
	"BPF MachineSSA Peephole Optimization For TRUNC Eliminate",
	false, false)

	char BPFMIPeepholeTruncElim::ID = 0;
	FunctionPass* llvm::createBPFMIPeepholeTruncElimPass()
	{
	return new BPFMIPeepholeTruncElim();
	}

llvm/lib/Target/BPF/BPFTargetMachine.cpp

Show All 36 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeBPFTarget() {
// Register the target.		// Register the target.
RegisterTargetMachine<BPFTargetMachine> X(getTheBPFleTarget());		RegisterTargetMachine<BPFTargetMachine> X(getTheBPFleTarget());
RegisterTargetMachine<BPFTargetMachine> Y(getTheBPFbeTarget());		RegisterTargetMachine<BPFTargetMachine> Y(getTheBPFbeTarget());
RegisterTargetMachine<BPFTargetMachine> Z(getTheBPFTarget());		RegisterTargetMachine<BPFTargetMachine> Z(getTheBPFTarget());

PassRegistry &PR = *PassRegistry::getPassRegistry();		PassRegistry &PR = *PassRegistry::getPassRegistry();
initializeBPFCheckAndAdjustIRPass(PR);		initializeBPFCheckAndAdjustIRPass(PR);
initializeBPFMIPeepholePass(PR);		initializeBPFMIPeepholePass(PR);
initializeBPFMIPeepholeTruncElimPass(PR);
initializeBPFDAGToDAGISelPass(PR);		initializeBPFDAGToDAGISelPass(PR);
}		}

// DataLayout: little or big endian		// DataLayout: little or big endian
static std::string computeDataLayout(const Triple &TT) {		static std::string computeDataLayout(const Triple &TT) {
if (TT.getArch() == Triple::bpfeb)		if (TT.getArch() == Triple::bpfeb)
return "E-m:e-p:64:64-i64:64-i128:128-n32:64-S128";		return "E-m:e-p:64:64-i64:64-i128:128-n32:64-S128";
else		else
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	void BPFPassConfig::addMachineSSAOptimization() {
// The default implementation must be called first as we want eBPF		// The default implementation must be called first as we want eBPF
// Peephole ran at last.		// Peephole ran at last.
TargetPassConfig::addMachineSSAOptimization();		TargetPassConfig::addMachineSSAOptimization();

const BPFSubtarget *Subtarget = getBPFTargetMachine().getSubtargetImpl();		const BPFSubtarget *Subtarget = getBPFTargetMachine().getSubtargetImpl();
if (!DisableMIPeephole) {		if (!DisableMIPeephole) {
if (Subtarget->getHasAlu32())		if (Subtarget->getHasAlu32())
addPass(createBPFMIPeepholePass());		addPass(createBPFMIPeepholePass());
addPass(createBPFMIPeepholeTruncElimPass());
}		}
}		}

void BPFPassConfig::addPreEmitPass() {		void BPFPassConfig::addPreEmitPass() {
addPass(createBPFMIPreEmitCheckingPass());		addPass(createBPFMIPreEmitCheckingPass());
if (getOptLevel() != CodeGenOpt::None)		if (getOptLevel() != CodeGenOpt::None)
if (!DisableMIPeephole)		if (!DisableMIPeephole)
addPass(createBPFMIPreEmitPeepholePass());		addPass(createBPFMIPreEmitPeepholePass());
}		}

llvm/test/CodeGen/BPF/remove_truncate_9.ll

This file was added.

				; RUN: llc -mcpu=v2 -march=bpf < %s \| FileCheck %s
				; RUN: llc -mcpu=v4 -march=bpf < %s \| FileCheck %s

				; Zero extension instructions should be eliminated at instruction
				; selection phase for all test cases below.

				; In BPF zero extension is implemented as &= or a pair of <<=/>>=
				; instructions, hence simply check that &= and >>= do not exist in
				; generated code (<<= remains because %c is used by both call and
				; lshr in a few test cases).

				; CHECK-NOT: &=
				; CHECK-NOT: >>=

				define void @shl_lshr_same_bb(ptr %p) {
				entry:
				%a = load i8, ptr %p, align 1
				%b = zext i8 %a to i64
				%c = shl i64 %b, 56
				%d = lshr i64 %c, 56
				%e = icmp eq i64 %d, 0
				; hasOneUse() is a common requirement for many CombineDAG
				; transofmations, make sure that it does not matter in this case.
				call void @sink1(i8 %a, i64 %b, i64 %c, i64 %d, i1 %e)
				ret void
				}

				define void @shl_lshr_diff_bb(ptr %p) {
				entry:
				%a = load i16, ptr %p, align 2
				%b = zext i16 %a to i64
				%c = shl i64 %b, 48
				%d = lshr i64 %c, 48
				br label %next

				; Jump to the new basic block creates a COPY instruction for %d, which
				; might be materialized as noop or as AND_ri (zero extension) at the
				; start of the basic block. The decision depends on TLI.isZExtFree()
				; results, see RegsForValue::getCopyToRegs(). Check below verifies
				; that COPY is materialized as noop.
				next:
				%e = icmp eq i64 %d, 0
				call void @sink2(i16 %a, i64 %b, i64 %c, i64 %d, i1 %e)
				ret void
				}

				define void @load_zext_same_bb(ptr %p) {
				entry:
				%a = load i8, ptr %p, align 1
				; zext is implicit in this context
				%b = icmp eq i8 %a, 0
				call void @sink3(i8 %a, i1 %b)
				ret void
				}

				define void @load_zext_diff_bb(ptr %p) {
				entry:
				%a = load i8, ptr %p, align 1
				br label %next

				next:
				%b = icmp eq i8 %a, 0
				call void @sink3(i8 %a, i1 %b)
				ret void
				}

				define void @load_zext_diff_bb_2(ptr %p) {
				entry:
				%a = load i32, ptr %p, align 4
				br label %next

				next:
				%b = icmp eq i32 %a, 0
				call void @sink4(i32 %a, i1 %b)
				ret void
				}

				declare void @sink1(i8, i64, i64, i64, i1);
				declare void @sink2(i16, i64, i64, i64, i1);
				declare void @sink3(i8, i1);
				declare void @sink4(i32, i1);

This is an archive of the discontinued LLVM Phabricator instance.

[BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()ClosedPublic

Details

Short description

Long description

Why this works: Pattern #1

Why this works: Pattern #2

Impact

Diff Detail

Event Timeline

Revision Contents

Diff 552128

llvm/lib/Target/BPF/BPF.h

llvm/lib/Target/BPF/BPFISelLowering.h

llvm/lib/Target/BPF/BPFISelLowering.cpp

llvm/lib/Target/BPF/BPFMIPeephole.cpp

llvm/lib/Target/BPF/BPFTargetMachine.cpp

llvm/test/CodeGen/BPF/remove_truncate_9.ll

[BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()
ClosedPublic