This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Eliminate range checks on br_tables
ClosedPublic

Authored by tlively on May 30 2020, 12:49 AM.

Details

Summary

Jump tables for most targets cannot handle out of range indices by
themselves, so LLVM emits range checks to guard the jump
tables. WebAssembly, on the other hand, implements jump tables using
the br_table instruction, which takes a default branch target as an
operand, making the range checks redundant. This patch introduces a
new MachineFunction pass in the WebAssembly backend to find and
eliminate the redundant range checks.

Diff Detail

Event Timeline

tlively created this revision.May 30 2020, 12:49 AM
tlively updated this revision to Diff 267446.May 30 2020, 1:29 AM
  • Update documentation and handle unreachable target blocks
Harbormaster completed remote builds in B58535: Diff 267444.
sbc100 added inline comments.May 30 2020, 10:27 AM
llvm/lib/Target/WebAssembly/WebAssemblyFixBrTableDefaults.cpp
120

Lies!

llvm/test/CodeGen/WebAssembly/cfg-stackify.ll
652–654

Can this be CHECK-NEXT too now? So we know the br_if is not there?

llvm/test/CodeGen/WebAssembly/switch-unreachable-default.ll
8

So if the default label contains unreachable then we end up in some special case where that target doesn't actually exist in the final output, and we instead just arbitrarily jump to any other the labels?

Is this standard llvm behaviour? Is is that default target in this case eliminated by some earlier pass?

I didn't know about this optimization opportunity! Please make sure to run emscripten tests before landing because it contains some new assumptions.

llvm/lib/Target/WebAssembly/WebAssemblyFixBrTableDefaults.cpp
46

Nit: It might help readers if there's a comment that prior to this function, br_table is missing the last (= default) operand, which will be added here

48

Nit: This might be more intuitive:

assert(MBB->pred_size() == 1);

By the way, is it guaranteed that we always have a single predecessor that is a guard BB, which ends with a conditional branch when the default BB is reachable and no branch when it is unreachable? Are other patterns possible?

60

Is this guaranteed that br_table BB is always a fallthrough, so the operand 0 is the default target? Can they be swapped?

65

Would there be performance implications based on what we choose arbitrarily for the default target, in case there's no guard hints? (I don't know and I'm just asking) If we are unsure, can it be safer to choose the first one, as we are doing now?

76

Why is this process (checking if Succ is a successor of HeaderMBB and deleting it from the successors) necessary?

llvm/test/CodeGen/WebAssembly/cfg-stackify.ll
652–654

Or we can explicitly check if it's not there, like

CHECK-NOT: br_if ...
aheejin added inline comments.Jun 1 2020, 7:06 AM
llvm/lib/Target/WebAssembly/WebAssemblyFixBrTableDefaults.cpp
60

If it is somehow guaranteed, it is fine, but if we are unsure, we can compare br_if's argument to MBB to figure it out.

tlively updated this revision to Diff 267723.Jun 1 2020, 1:50 PM
tlively marked 8 inline comments as done.
  • Address comments
tlively added inline comments.Jun 1 2020, 5:31 PM
llvm/lib/Target/WebAssembly/WebAssemblyFixBrTableDefaults.cpp
46

Done.

48

Done.

There is always a single guard BB that contains a conditional branch to the default target if the default target exists. Then the guard BB either falls through or has an unconditional jump to the jump table BB.

60

They would only be swapped if something in between SelectionDAG building and this pass swapped them, but I'll add an assert in case some later change swaps the order.

65

I suppose it might affect register allocation or something, but I'm not sure. I'll switch to using the first target.

76

We need to remove the shared successors first because otherwise transferSuccessorsAndUpdatePHIs could introduce duplicate successors, which is invalid. I'll add a comment to that effect.

120

Oops!

llvm/test/CodeGen/WebAssembly/cfg-stackify.ll
652–654

Used the CHECK-NOT because IIRC there's a bunch of extra junk in there preventing us from using CHECK-NEXT.

llvm/test/CodeGen/WebAssembly/switch-unreachable-default.ll
8

Correct. The target-independent code doesn't emit any jump to a default target to begin with when the default is unreachable, so on other targets if the default did turn out to be reachable at runtime, then there would be an out-of-bounds access to the jump table.

The relevant code that creates the jump to the default block is here: https://github.com/llvm/llvm-project/blob/f027cfa37e6757bb2d17ac3bea944df4e06bcff4/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L2475-L2482.

aheejin accepted this revision.Jun 1 2020, 9:57 PM
This revision is now accepted and ready to land.Jun 1 2020, 9:57 PM
This revision was automatically updated to reflect the committed changes.
kcc added a subscriber: kcc.Jun 2 2020, 7:25 PM

Hi, 
This made our ubsan bots red. Please fix or revert ASAP
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/42256

In D80863#2070105, @kcc wrote:

Hi, 
This made our ubsan bots red. Please fix or revert ASAP
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/42256

Same for our downstream bots. When we build llc with gcc 7.4 we get crashes when running the following tests:

08:01:10 Failing Tests (5):
08:01:10 LLVM :: CodeGen/WebAssembly/cfg-stackify.ll
08:01:10 LLVM :: CodeGen/WebAssembly/indirectbr.ll
08:01:10 LLVM :: CodeGen/WebAssembly/stack-insts.ll
08:01:10 LLVM :: CodeGen/WebAssembly/switch-unreachable-default.ll
08:01:10 LLVM :: CodeGen/WebAssembly/switch.ll

For some reason I have not been able to reproduce these failure locally. I am reverting the change pending further investigation. @kcc what compiler version is your bot using?

I was testing the performance of a program with a big switch statement in a loop, a very common pattern in C/C++ and I came across the problem with needless range checks with br_table that this patch is fixing (Sweet! :-))
However, testing the latest version of Emscripten, with this fix, I am finding that Clang now emits "worse" bytecode for my small test program (attached{F12106893}, compiled with emcc -O3 micro-interp.c -o micro-interp.js).
The code is like:

enum Op {
    A = 0, B, C, D, E, F, G, H, I, J, K, L
};
int f(Op* ops, int len) {
    int result = 0;
    for (int i = 0; i < len; i++) {
        Op op = ops[i];
        switch (op) {
            case A: { ... break; }
            case B: { ... break; }
            case C: { ... break; }
		...
            default: { ... break; }
        }
    }
    return result;
}

Before this change, the function was compiled into this Wasm code, with a single br_table that was not using its default label (see WAT file attached:

):

(func (;11;) (type 2) (param i32)
  (local i32 i32 i32 i32 i32 i64)
  ...
  block  ;; label = @1
    block  ;; label = @2
      loop  ;; label = @3
        local.get 4
        i32.load16_u
        local.tee 3
        i32.const 11
        i32.gt_u
        br_if 1 (;@2;)
        block  ;; label = @4
          block  ;; label = @5
            block  ;; label = @6
              block  ;; label = @7
                block  ;; label = @8
                  block  ;; label = @9
                    block  ;; label = @10
                      block  ;; label = @11
                        block  ;; label = @12
                          block  ;; label = @13
                            block  ;; label = @14
                              local.get 3
                              br_table 10 (;@4;) 0 (;@14;) 1 (;@13;) 2 (;@12;) 3 (;@11;) 4 (;@10;) 5 (;@9;) 6 (;@8;) 7 (;@7;) 8 (;@6;) 13 (;@1;) 9 (;@5;) 10 (;@4;)
                            end
                            // case 1
                          end
                          // case 2
                        end
                      ...

The x64 code jitted by V8 for the switch was reasonably compact, though not-optimal:

00000000BD10FA43    83  460fb73423     movzxwl r14,[rbx+r12*1]
00000000BD10FA48    88  4183fe0b       cmpl r14,0xB 
00000000BD10FA4C    8c  0f8741020000   ja 00000000BD10FC93  // jmp to default case
00000000BD10FA52    92  4183ee0        subl r14,0x1
00000000BD10FA56    96  458bf6         movl r14,r14
00000000BD10FA59    99  4183fe0b       cmpl r14,0xB         
00000000BD10FA5D    9d  0f830d000000   jnc 00000000BD10FA70 // jmp to br_table default label
00000000BD10FA63    a3  4c8d1556030000 leaq r10,[rip+0x356]
00000000BD10FA6A    aa  43ff24f2       jmp [r10+r14*8]      // br_table jump

There were two checks (cmp/jmp), the first for the switch default case, the second for the implementation of br_table.

But now with this change I see this code being generated (see WAT file attached:

)

(func (;6;) (type 6) (param i32)
  (local i32 i32 i32 i32 i64 i32)
  ...
  block  ;; label = @1
    block  ;; label = @2
      block  ;; label = @3
        block  ;; label = @4
          block  ;; label = @5
            block  ;; label = @6
              block  ;; label = @7
                block  ;; label = @8
                  block  ;; label = @9
                    block  ;; label = @10
                      block  ;; label = @11
                        block  ;; label = @12
                          block  ;; label = @13
                            block  ;; label = @14
                              local.get 4
                              i32.load16_u
                              br_table 1 (;@13;) 0 (;@14;) 2 (;@12;) 3 (;@11;) 4 (;@10;) 5 (;@9;) 6 (;@8;) 7 (;@7;) 8 (;@6;) 9 (;@5;) 13 (;@1;) 10 (;@4;) 12 (;@2;)
                            end
                            i32.const 0
                            local.set 3
                            br 10 (;@3;)
                          end
                          i32.const 10
                          local.set 3
                          br 9 (;@3;)
                        end
                        i32.const 1
                        local.set 3
                        br 8 (;@3;)
                      end
                      i32.const 2
                      local.set 3
                      br 7 (;@3;)
                    end
                    i32.const 3
                    local.set 3
                    br 6 (;@3;)
                  end
                  i32.const 4
                  local.set 3
                  br 5 (;@3;)
                end
                i32.const 5
                local.set 3
                br 4 (;@3;)
              end
              i32.const 6
              local.set 3
              br 3 (;@3;)
            end
            i32.const 7
            local.set 3
            br 2 (;@3;)
          end
          i32.const 8
          local.set 3
          br 1 (;@3;)
        end
        i32.const 9
        local.set 3
      end
      loop  ;; label = @3
        block  ;; label = @4
          block  ;; label = @5
            block  ;; label = @6
              block  ;; label = @7
                block  ;; label = @8
                  block  ;; label = @9
                    block  ;; label = @10
                      block  ;; label = @11
                        block  ;; label = @12
                          block  ;; label = @13
                            block  ;; label = @14
                              block  ;; label = @15
                                block  ;; label = @16
                                  block  ;; label = @17
                                    block  ;; label = @18
                                      block  ;; label = @19
                                        block  ;; label = @20
                                          block  ;; label = @21
                                            block  ;; label = @22
                                              block  ;; label = @23
                                                block  ;; label = @24
                                                  block  ;; label = @25
                                                    local.get 3
                                                    br_table 0 (;@25;) 1 (;@24;) 2 (;@23;) 3 (;@22;) 4 (;@21;) 5 (;@20;) 6 (;@19;) 7 (;@18;) 8 (;@17;) 9 (;@16;) 10 (;@15;) 10 (;@15;)
                                                  end
                                                  // ... case 0
                                                  ...
                                                  br_table 19 (;@5;) 20 (;@4;) 10 (;@14;) 11 (;@13;) 12 (;@12;) 13 (;@11;) 14 (;@10;) 15 (;@9;) 16 (;@8;) 17 (;@7;) 23 (;@1;) 18 (;@6;) 22 (;@2;)
                                                end
                                                // ... case 1
                                                br_table 18 (;@5;) 19 (;@4;) 9 (;@14;) 10 (;@13;) 11 (;@12;) 12 (;@11;) 13 (;@10;) 14 (;@9;) 15 (;@8;) 16 (;@7;) 22 (;@1;) 17 (;@6;) 21 (;@2;)
                                              end
                                            ...

There is a br_table that causes an indirect jump to a stub like i32.const X local.set 3 br Y that jumps to the actual code for the case branches, which all end with other strange br_tables.
Obviously also the native code jitted is much more convoluted. The result is that my small benchmark is 34% slower (33.9 sec vs 25.2).

I see that this issue disappears if I comment away addPass(createWebAssemblyFixBrTableDefaults()); in WebAssemblyPassConfig::addInstSelector() and reintroduce Ops.push_back(DAG.getBasicBlock(MBBs[0])) in WebAssemblyTargetLowering::LowerBR_JT().

Maybe there is something wrong with my configuration, in my machine, but I double-tested this reinstalling the latest Emscripten, and can reliably reproduce the problem.
Could you take a look?

@paolosev Hmm, I suspect that there is a bad interaction between the code with the range check removed and the algorithm that lays out and structures the basic blocks. I will take a closer look!

@paolosev Hmm, I suspect that there is a bad interaction between the code with the range check removed and the algorithm that lays out and structures the basic blocks. I will take a closer look!

Also notice, if it can help, that this problem happens when compiling with -O3 but not with -Os.