Replace BPFMIPeepholeTruncElim by adding an overload for TargetLowering::isZExtFree() aware that zero extension is free for ISD::LOAD.
Short description
The BPFMIPeepholeTruncElim handles two patterns:
Pattern #1:
%1 = LDB %0, ... %1 = LDB %0, ... %2 = AND_ri %1, 0xff -> %2 = MOV_ri %1 <-- (!)
Pattern #2:
bb.1: bb.1: %a = LDB %0, ... %a = LDB %0, ... br %bb3 br %bb3 bb.2: bb.2: %b = LDB %0, ... -> %b = LDB %0, ... br %bb3 br %bb3 bb.3: bb.3: %1 = PHI %a, %b %1 = PHI %a, %b %2 = AND_ri %1, 0xff %2 = MOV_ri %1 <-- (!)
Plus variations:
- AND_ri_32 instead of AND_ri
- SLL/SLR instead of AND_ri
- LDH, LDW, LDB32, LDH32, LDW32
Both patterns could be handled by built-in transformations at instruction selection phase if suitable isZExtFree() implementation is provided. The idea is borrowed from ARMTargetLowering::isZExtFree.
When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM test cases this revisions performs slightly better than BPFMIPeepholeTruncElim, see "Impact" section below for details.
Commit also adds a few test cases to make sure that patterns in question are handled.
Long description
Why this works: Pattern #1
Consider the following example:
define i1 @foo(ptr %p) { entry: %a = load i8, ptr %p, align 1 %cond = icmp eq i8 %a, 0 ret i1 %cond }
Log for llc -mcpu=v2 -mtriple=bpfel -debug-only=isel command:
... Type-legalized selection DAG: %bb.0 'foo:entry' SelectionDAG has 13 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64 t19: i64 = and t16, Constant:i64<255> t17: i64 = setcc t19, Constant:i64<0>, seteq:ch t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ... Replacing.1 t19: i64 = and t16, Constant:i64<255> With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64 and 0 other values ... Optimized type-legalized selection DAG: %bb.0 'foo:entry' SelectionDAG has 11 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64 t17: i64 = setcc t20, Constant:i64<0>, seteq:ch t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ...
Note:
- Optimized type-legalized selection DAG:
- t19 = and t16, 255 had been replaced by t16 (load).
- Patterns like (and (load ... i8), 255) are replaced by load in DAGCombiner::BackwardsPropagateMask called from DAGCombiner::visitAND.
- Similarly patterns like (shl (srl ..., 56), 56) are replaced by (and ..., 255) in DAGCombiner::visitSRL (this function is huge, look for TLI.shouldFoldConstantShiftPairToMask() call).
Why this works: Pattern #2
Consider the following example:
define i1 @foo(ptr %p) { entry: %a = load i8, ptr %p, align 1 br label %next next: %cond = icmp eq i8 %a, 0 ret i1 %cond }
Consider log for llc -mcpu=v2 -mtriple=bpfel -debug-only=isel command.
Log for first basic block:
Initial selection DAG: %bb.0 'foo:entry' SelectionDAG has 9 nodes: t0: ch,glue = EntryToken t3: i64 = Constant<0> t2: i64,ch = CopyFromReg t0, Register:i64 %1 t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64 t6: i64 = zero_extend t5 t8: ch = CopyToReg t0, Register:i64 %0, t6 ... Replacing.1 t6: i64 = zero_extend t5 With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64 and 0 other values ... Optimized lowered selection DAG: %bb.0 'foo:entry' SelectionDAG has 7 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %1 t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64 t8: ch = CopyToReg t0, Register:i64 %0, t9
Note:
- Initial selection DAG:
- %a = load ... is lowered as t6 = (zero_extend (load ...)) w/o special isZExtFree() overload added by this commit it is instead lowered as t6 = (any_extend (load ...)).
- The decision to generate zero_extend or any_extend is done in RegsForValue::getCopyToRegs called from SelectionDAGBuilder::CopyValueToVirtualRegister:
- if isZExtFree() for load returns true zero_extend is used;
- any_extend is used otherwise.
- Optimized lowered selection DAG:
- t6 = (any_extend (load ...)) is replaced by t9 = load ..., zext from i8. This is done by DagCombiner.cpp:tryToFoldExtOfLoad() called from DAGCombiner::visitZERO_EXTEND.
Log for second basic block:
Initial selection DAG: %bb.1 'foo:next' SelectionDAG has 13 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t4: i64 = AssertZext t2, ValueType:ch:i8 t5: i8 = truncate t4 t8: i1 = setcc t5, Constant:i8<0>, seteq:ch t9: i64 = any_extend t8 t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ... Replacing.2 t18: i64 = and t4, Constant:i64<255> With: t4: i64 = AssertZext t2, ValueType:ch:i8 ... Type-legalized selection DAG: %bb.1 'foo:next' SelectionDAG has 13 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t4: i64 = AssertZext t2, ValueType:ch:i8 t18: i64 = and t4, Constant:i64<255> t16: i64 = setcc t18, Constant:i64<0>, seteq:ch t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ... Optimized type-legalized selection DAG: %bb.1 'foo:next' SelectionDAG has 11 nodes: t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t4: i64 = AssertZext t2, ValueType:ch:i8 t16: i64 = setcc t4, Constant:i64<0>, seteq:ch t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16 t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1 ...
Note:
- Initial selection DAG:
- t0 is an input value for this basic block, it corresponds load instruction (t9) from the first basic block.
- It is accessed within basic block via t4 (AssertZext (CopyFromReg t0, ...)).
- The AssertZext is generated by RegsForValue::getCopyFromRegs called from SelectionDAGBuilder::getCopyFromRegs, it is generated only when LiveOutInfo with known number of leading zeros is present for t0.
- Known register bits in LiveOutInfo are computed by SelectionDAG::computeKnownBits called from SelectionDAGISel::ComputeLiveOutVRegInfo.
- computeKnownBits() generates leading zeros information for (load ..., zext from ...) but *does not* generate leading zeros information for (load ..., anyext from ...). This is why isZExtFree() added in this commit is important.
- Type-legalized selection DAG:
- t5 = truncate t4 is replaced by t18 = and t4, 255
- Optimized type-legalized selection DAG:
- t18 = and t4, 255 is replaced by t4, this is done by DAGCombiner::SimplifyDemandedBits called from DAGCombiner::visitAND, which simplifies patterns like (and (assertzext ...)).
Impact
This change covers all remove_truncate_*.ll test cases:
- for -mcpu=v4 there are no changes in the generated code;
- for -mcpu=v2 code generated for remove_truncate_7 and remove_truncate_8 improved slightly, for other tests it is unchanged.
For remove_truncate_7:
Before this revision After this revision -------------------- ------------------- r1 <<= 0x20 r1 <<= 0x20 r1 >>= 0x20 r1 >>= 0x20 if r1 == 0x0 goto +0x2 <LBB0_2> if r1 == 0x0 goto +0x2 <LBB0_2> r1 = *(u32 *)(r2 + 0x0) r0 = *(u32 *)(r2 + 0x0) goto +0x1 <LBB0_3> goto +0x1 <LBB0_3> <LBB0_2>: <LBB0_2>: r1 = *(u32 *)(r2 + 0x4) r0 = *(u32 *)(r2 + 0x4) <LBB0_3>: <LBB0_3>: r0 = r1 exit exit
For remove_truncate_8:
Before this revision After this revision -------------------- ------------------- r2 = *(u32 *)(r1 + 0x0) r2 = *(u32 *)(r1 + 0x0) r3 = r2 r3 = r2 r3 <<= 0x20 r3 <<= 0x20 r4 = r3 r3 s>>= 0x20 r4 s>>= 0x20 if r4 s> 0x2 goto +0x5 <LBB0_3> if r3 s> 0x2 goto +0x4 <LBB0_3> r4 = *(u32 *)(r1 + 0x4) r3 = *(u32 *)(r1 + 0x4) r3 >>= 0x20 if r3 >= r4 goto +0x2 <LBB0_3> if r2 >= r3 goto +0x2 <LBB0_3> r2 += 0x2 r2 += 0x2 *(u32 *)(r1 + 0x0) = r2 *(u32 *)(r1 + 0x0) = r2 <LBB0_3>: <LBB0_3>: r0 = 0x3 r0 = 0x3 exit exit
For kernel BPF selftests statistics is as follows: (-mcpu=v4):
- For -mcpu=v4: 9 out of 655 object files have differences, in all cases total number of instructions marginally decreased (-27 instructions).
- For -mcpu=v2: 9 out of 655 object files have differences:
- For 19 object files number of instruction decreased (-129 instruction in total): some redundant rX &= 0xffff and register to register assignments removed;
- For 2 object files number of instructions increased +2 instructions in each file.
Both -mcpu=v2 instruction increases could be reduced to the same example:
define void @foo(ptr %p) { entry: %a = load i32, ptr %p, align 4 %b = sext i32 %a to i64 %c = icmp ult i64 1, %b br i1 %c, label %next, label %end next: call void inttoptr (i64 62 to ptr)(i32 %a) br label %end end: ret void }
Note that this example uses value loaded to %a both as a sign extended (%b) and as zero extended (%a passed as parameter). Here is the difference in final assembly code:
Before this revision After this revision -------------------- ------------------- r1 = *(u32 *)(r1 + 0) r1 = *(u32 *)(r1 + 0) r1 <<= 32 r1 <<= 32 r1 s>>= 32 r1 s>>= 32 if r1 < 2 goto <LBB0_2> if r1 < 2 goto <LBB0_2> r1 <<= 32 r1 >>= 32 call 62 call 62 <LBB0_2>: <LBB0_2>: exit exit
Before this commit %a is passed to call as a sign extended value, after this commit %a is passed to call as a zero extended value, both are correct as 32-bit sub-register is the same.
The difference comes from DAGCombiner operation on the initial DAG.
Initial selection DAG before this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64 t6: i64 = any_extend t5 <--------------------- (1) t8: ch = CopyToReg t0, Register:i64 %0, t6 t9: i64 = sign_extend t5 t12: i1 = setcc Constant:i64<1>, t9, setult:ch
Initial selection DAG after this commit:
t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64 t6: i64 = zero_extend t5 <--------------------- (2) t8: ch = CopyToReg t0, Register:i64 %0, t6 t9: i64 = sign_extend t5 t12: i1 = setcc Constant:i64<1>, t9, setult:ch
The node t9 is processed before node t6 and load instruction is combined to load with sign extension:
Replacing.1 t9: i64 = sign_extend t5 With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64 and 0 other values Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64 With: t31: i32 = truncate t30 and 1 other values
This is done by DAGCombiner.cpp:tryToFoldExtOfLoad called from DAGCombiner::visitSIGN_EXTEND. Note that t5 is used by t6 which is any_extend in (1) and zero_extend in (2). tryToFoldExtOfLoad() rewrites such uses of t5 differently:
- any_extend is simply removed
- zero_extend is replaced by and t30, 0xffffffff, which is later converted to a pair of shifts. This pair of shifts survives till the end of translation.