This is an archive of the discontinued LLVM Phabricator instance.

[bpf] zero extension is required in BPF implementaiton so remove <<=32 >>=32
ClosedPublic

Authored by jrfastab on Feb 4 2020, 11:36 AM.

Details

Summary
The current pattern matching for zext results in the following code snippet
being produced,

  w1 = w0
  r1 <<= 32
  r1 >>= 32

Because BPF implementations require zero extension on 32bit loads this
both adds a few extra unneeded instructions but also makes it a bit
harder for the verifier to track the r1 register bounds. For example in
this verifier trace we see at the end of the snippet R2 offset is unknown.
However, if we track this correctly we see w1 should have the same bounds
as r8. R8 smax is less than U32 max value so a zero extend load should keep
the same value. Adding a max value of 800 (R8=inv(id=0,smax_value=800)) to
an off=0, as seen in R7 should create a max offset of 800. However at the
end of the snippet we note the R2 max offset is 0xffffFFFF.

  R0=inv(id=0,smax_value=800)
  R1_w=inv(id=0,umax_value=2147483647,var_off=(0x0; 0x7fffffff))
  R6=ctx(id=0,off=0,imm=0) R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
  R8_w=inv(id=0,smax_value=800,umax_value=4294967295,var_off=(0x0; 0xffffffff))
  R9=inv800 R10=fp0 fp-8=mmmm????
 58: (1c) w9 -= w8
 59: (bc) w1 = w8
 60: (67) r1 <<= 32
 61: (77) r1 >>= 32
 62: (bf) r2 = r7
 63: (0f) r2 += r1
 64: (bf) r1 = r6
 65: (bc) w3 = w9
 66: (b7) r4 = 0
 67: (85) call bpf_get_stack#67
  R0=inv(id=0,smax_value=800)
  R1_w=ctx(id=0,off=0,imm=0)
  R2_w=map_value(id=0,off=0,ks=4,vs=1600,umax_value=4294967295,var_off=(0x0; 0xffffffff))
  R3_w=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
  R4_w=inv0 R6=ctx(id=0,off=0,imm=0)
  R7=map_value(id=0,off=0,ks=4,vs=1600,imm=0)
  R8_w=inv(id=0,smax_value=800,umax_value=4294967295,var_off=(0x0; 0xffffffff))
  R9_w=inv(id=0,umax_value=800,var_off=(0x0; 0x3ff))
  R10=fp0 fp-8=mmmm????

After this patch R1 bounds are not smashed by the <<=32 >>=32 shift and we
get correct bounds on R2 umax_value=800.

Further it reduces 3 insns to 1.

Diff Detail

Event Timeline

jrfastab created this revision.Feb 4 2020, 11:36 AM

We can't commit this IMO until we resolve selftests in kernel side verifier, this introduces some failures. At the moment this is causing some new failures in existing test code.

We can't commit this IMO until we resolve selftests in kernel side verifier, this introduces some failures. At the moment this is causing some new failures in existing test code.

What kind of version compatibility is there between the kernel verifier and LLVM/clang? Does the latest kernel RC only work with LLVM git?

You should add a test.

@jrfastab I added isTruncateFree() callback in BPF IR Lowering (isel) phase. See https://reviews.llvm.org/D74101. This should help with the case where packet data/data_end is truncated with 32bit MOV.

@yonghong-song @ast Now we have proper alu32 fixes in kernel side how about we add this. We've been running with it for a couple months now?

ast added a comment.Apr 8 2020, 8:50 AM

@yonghong-song @ast Now we have proper alu32 fixes in kernel side how about we add this. We've been running with it for a couple months now?

Yeah, let's land it, but please add a test first.

@jrfastab agree with Alexei. please add a test. Thanks.

jrfastab updated this revision to Diff 265774.May 22 2020, 10:56 AM
jrfastab edited the summary of this revision. (Show Details)

Added zext test.

jrfastab updated this revision to Diff 265775.May 22 2020, 11:03 AM

Forgot to commit the int->unsigned int in the test function to get zext as expected instead of the sext, pushing now.

Thanks! The change looks good. There are 5 more tests need adjustment with this optimization. Could you fix them as well?

Failing Tests (5):
  LLVM :: CodeGen/BPF/32-bit-subreg-cond-select.ll
  LLVM :: CodeGen/BPF/32-bit-subreg-peephole-phi-1.ll
  LLVM :: CodeGen/BPF/32-bit-subreg-peephole-phi-2.ll
  LLVM :: CodeGen/BPF/32-bit-subreg-peephole-phi-3.ll
  LLVM :: CodeGen/BPF/32-bit-subreg-peephole.ll
jrfastab added a comment.EditedMay 25 2020, 9:22 AM

@yonghong-song I can update the tests but I'm not completely following the CHECKs in 32-bit-subreg-peephole.ll

; long long select_u(unsigned a, unsigned b, long long c, long long d)
; {
;   if (a > b)
;     return c;
;   else
;     return d;
; }

with corresponding checks,

; Function Attrs: norecurse nounwind readnone
define dso_local i64 @select_u(i32 %a, i32 %b, i64 %c, i64 %d) local_unnamed_addr #0 {
; CHECK-LABEL: select_u:
entry:
  %cmp = icmp ugt i32 %a, %b
  %c.d = select i1 %cmp, i64 %c, i64 %d
; CHECK: r{{[0-9]+}} <<= 32
; CHECK-NEXT: r{{[0-9]+}} >>= 32
; CHECK: if r{{[0-9]+}} {{<|>}} r{{[0-9]+}} goto
  ret i64 %c.d
}

Couple questions if you happen to know could save some time on my side. First why does this not use a 32-bit compare, %a is 32-bits and %b is 32-bit seems a JMP32 instruction should work OK. Here is the assembly generated,

# %bb.0:                                # %entry
        r0 = r3
        r1 <<= 32
        r1 >>= 32
        r2 <<= 32
        r2 >>= 32
        if r1 > r2 goto LBB0_2
# %bb.1:                                # %entry
        r0 = r4
LBB0_2:                                 # %entry
        exit
.Lfunc_end0:

Next with the patch I created here we drop the zext but it still does not use the alu32 ops. Here is the asm,

# %bb.0:                                # %entry
        r0 = r3
        if r1 > r2 goto LBB0_2
# %bb.1:                                # %entry
        r0 = r4
LBB0_2:                                 # %entry
        exit

This is probably not correct because how do we "know" the upper bits are zero seems we still need to zero them. Looking at the passes before peephole optimization,

# *** IR Dump After BPF PreEmit Checking ***:
# Machine code for function select_u: NoPHIs, TracksLiveness, NoVRegs
Function Live Ins: $w1, $w2, $r3, $r4

bb.0.entry:
  successors: %bb.1(0x40000000), %bb.2(0x40000000); %bb.1(50.00%), %bb.2(50.00%)
  liveins: $r3, $r4, $w1, $w2
  $r0 = MOV_rr $r3
  $r1 = MOV_32_64 killed $w1
  $r2 = MOV_32_64 killed $w2
  JUGT_rr killed $r1, killed $r2, %bb.2

bb.1.entry:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)
  liveins: $r4
  $r0 = MOV_rr killed $r4

bb.2.entry:
; predecessors: %bb.0, %bb.1
  liveins: $r0
  RET implicit $r0

# End machine code for function select_u.

And then after optimization,

# *** IR Dump After BPF PreEmit Peephole Optimization ***:
# Machine code for function select_u: NoPHIs, TracksLiveness, NoVRegs
Function Live Ins: $w1, $w2, $r3, $r4

bb.0.entry:
  successors: %bb.1(0x40000000), %bb.2(0x40000000); %bb.1(50.00%), %bb.2(50.00%)
  liveins: $r3, $r4, $w1, $w2
  $r0 = MOV_rr $r3
  JUGT_rr killed $r1, killed $r2, %bb.2

bb.1.entry:
; predecessors: %bb.0
  successors: %bb.2(0x80000000); %bb.2(100.00%)
  liveins: $r4
  $r0 = MOV_rr killed $r4

bb.2.entry:
; predecessors: %bb.0, %bb.1
  liveins: $r0
  RET implicit $r0

# End machine code for function select_u.

Seems something went wrong here? I'll look into it tomorrow but its not what I expected on two cases. First I expected a jmp32 op and second that last optimization seems wrong?

jrfastab added a comment.EditedMay 25 2020, 9:25 AM

To partially answer one questions. Seems jmp32 is only used with mcpu=v3 so that explains the lack of it above but still seems like the -mcpu=v2 -mattr=+alu32 needs to be fixed ~with~ to work with my patch.

Seems something went wrong here? I'll look into it tomorrow but its not what I expected on two cases. First I expected a jmp32 op and second that last optimization seems wrong?

I checked the code. I agree with you that optimization seems not right.
In BPFMIPeephole.cpp, we have

// Eliminate identical move:
//
//   MOV rA, rA
//
// This is particularly possible to happen when sub-register support
// enabled. The special type cast insn MOV_32_64 involves different
// register class on src (i32) and dst (i64), RA could generate useless
// instruction due to this.
unsigned Opcode = MI.getOpcode();
if (Opcode == BPF::MOV_32_64 ||
    Opcode == BPF::MOV_rr || Opcode == BPF::MOV_rr_32) {
  Register dst = MI.getOperand(0).getReg();
  Register src = MI.getOperand(1).getReg();

  if (Opcode == BPF::MOV_32_64)
    dst = TRI->getSubReg(dst, BPF::sub_32);

  if (dst != src)
    continue;

  ToErase = &MI;
  RedundantMovElemNum++;
  Eliminated = true;
}

MOV_32_64 actually problematic here as it is not a simple register copy, it has side effect.
The same as MOV_rr_32. it also has a side effect to zero out the top bits. We cannot simply remove
them.

What we need to do is to further trace def/use chain, unless the 32bit reg already has upper bits
zeroed, we should not do the optimization.

Will work on a fix ASAP. Thanks for spotting the problem!

yonghong-song added a comment.EditedMay 26 2020, 12:19 AM

I think the following are the steps we should take:

  1. We need to fold the following change to this commit:
diff --git a/llvm/lib/Target/BPF/BPFMIPeephole.cpp b/llvm/lib/Target/BPF/BPFMIPeephole.cpp 
index a2ceade6680..fe955fad042 100644 
--- a/llvm/lib/Target/BPF/BPFMIPeephole.cpp 
+++ b/llvm/lib/Target/BPF/BPFMIPeephole.cpp 
@@ -301,19 +301,16 @@ bool BPFMIPreEmitPeephole::eliminateRedundantMov(void) { 
       // 
       //   MOV rA, rA 
       // 
-      // This is particularly possible to happen when sub-register support 
-      // enabled. The special type cast insn MOV_32_64 involves different 
-      // register class on src (i32) and dst (i64), RA could generate useless 
-      // instruction due to this. 
+      // Note that we cannot remove 
+      //   MOV_32_64  rA, wA 
+      //   MOV_rr_32  wA, wA 
+      // as these two instructions having side effects, zeroing out 
+      // top 32 bits of rA. 
       unsigned Opcode = MI.getOpcode(); 
-      if (Opcode == BPF::MOV_32_64 || 
-          Opcode == BPF::MOV_rr || Opcode == BPF::MOV_rr_32) { 
+      if (Opcode == BPF::MOV_rr) { 
         Register dst = MI.getOperand(0).getReg(); 
         Register src = MI.getOperand(1).getReg(); 
  
-        if (Opcode == BPF::MOV_32_64) 
-          dst = TRI->getSubReg(dst, BPF::sub_32); 
- 
         if (dst != src) 
           continue;

We do not this problem before since MOV_32_64 is always generated together with shifts and that is taken care of by SSA peephole optimization.

But since this patch may generate MOV_32_64 without shifts, we need to make this change.

@jrfastab could you take care of this?

  1. optionally, we need to enhance SSA peephole optimization for MOV_32_64 without shift case. This is only one inst since there are no shifts. We can do it later if we do not want to do it in this patch.

@jrfastab you can take a look at BPFMIPeephole.cpp to see whether you want to do the optimization now or delay to later.

@yonghong-song I can merge the fix with this patch, but why do we eliminate MOV_rr? I'm trying to see where/why this case would happen I'm not seeing other backends with similar logic. Could we just remove the entire block for all cases?

@yonghong-song I can merge the fix with this patch, but why do we eliminate MOV_rr? I'm trying to see where/why this case would happen I'm not seeing other backends with similar logic. Could we just remove the entire block for all cases?

No particular reason. Just want to be cautiously. I agree that MOV_rr should be really unlikely generated by the compiler, but with 32bit subregister, not 100% sure. Could you check with kernel selftest bpf programs? If any program hits MOV_rr, I suggest to keep it. Otherwise, we can remove it.

jrfastab updated this revision to Diff 266368.May 26 2020, 4:09 PM

Update BPFMIPeephole.cpp to remove incorrect MOV_32_64 and MOV_rr_32 instruction elimination
which should not be done because these insns have side effects. I left the MOV_rr case in
place for now but may remove it later after doing some more digging to see if this case
ever pops up in our code.

Also updated tests to catch zext using MOV_* instructions now. I converted the CHECK <<=32 >==32
cases to CHECK-NOT to ensure we no longer generate these lines.

yonghong-song accepted this revision.May 26 2020, 5:48 PM

LGTM. Thanks! @jrfastab Do you have "git push" permission? If yes, you can directly push the change to trunk. Otherwise, I can help push the patch in. Let me know.

This revision is now accepted and ready to land.May 26 2020, 5:48 PM

@yonghong-song no I don't have "git push" permission. I guess I could request them if it would be helpful, but for now might be best if you could push for me. Thanks.

sounds good. I will git push the patch tomorrow..

This revision was automatically updated to reflect the committed changes.