Rewrite WebAssemblyFixIrreducibleControlFlow to a simpler and cleaner design, which directly computes reachability and other properties itself. This avoids previous complexity and bugs. (The new graph analyses are very similar to how the Relooper algorithm would find loop entries and so forth.) This fixes a few bugs, including where we had a false positive and thought fannkuch was irreducible when it was not, which made us much larger and slower there, and a reverse bug where we missed irreducibility. On fannkuch, we used to be 44% slower than asm2wasm and are now 4% faster.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Sorry for the delayed reply!
What changes fixed the previous bug?
lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp | ||
---|---|---|
30 ↗ | (On Diff #189189) | Why is this? |
144 ↗ | (On Diff #189189) | Real nit: We can merge these into a single line: MachineBasicBlock *MBB, *Succ; |
158 ↗ | (On Diff #189189) | We used to have the description on the definitions of 'loopers' and 'non-loopers' in the comment at the beginning of the file, but not anymore. Can we have that again here maybe? |
167 ↗ | (On Diff #189189) | Does this definition cover when Entry is an entry (or header) of a loop? I'm wondering because the reachability computation here does not consider backedges to Entry, so it will not be discovered as a looper. Also does it cover this case? bb0: br bb1 bb1: br_if bb0 br bb2 bb2: br bb3 bb3: br bb2 So here bb0 and bb1 comprise a loop and bb2 and bb3 comprise another loop. And all four blocks are loopers, but bb0 and bb2 are loop entries. But they are not reachable from any of non-loopers, because there are no non-loopers in this CFG. |
201 ↗ | (On Diff #189189) | Why is this worklist a set? |
222 ↗ | (On Diff #189189) | Do we need to make this as a separate class now? Unlike the previous version, we create LoopFixer object only once in WebAssemblyFixIrreducibleControlFlow::runOnMachineFunction. I think we can add paste the body of LoopFixer::run into WebAssemblyFixIrreducibleControlFlow::runOnMachineFunction and make LoopFixer:: processRegion as a member function of WebAssemblyFixIrreducibleControlFlow that returns a bool. |
270 ↗ | (On Diff #189189) | I might be mistaken, but the algorithm seems to not care about whether loop entries found are within the same level loop. For example, the case where there are two-level nested loops and block A is one of the entries to the outer loop and B is one of the entries for the inner loop. Even in this case, we don't distinguish them and resolve these with a single pass with a single dispatch block. Is this OK and intended? And for the 'mutual entries' above, can't they also be not in the same loop but each in an outer loop and an inner loop? |
275 ↗ | (On Diff #189189) |
while (FoundIrreducibility) { ReachabilityGraph Graph(Entry, Blocks); bool FoundIrreducibility = false; for (auto *LoopEntry : Graph.getLoopEntries()) { ... } } for (auto *LoopEntry : Graph.getLoopEntries()) { ... }
|
286 ↗ | (On Diff #189189) | In which case do we need to recurse into inner loops? (Given that we didn't distinguish inner loops and outer loops above) Could you add some comments? |
303 ↗ | (On Diff #189189) | Nit: We can do in a single line BlockVector SortedEntries(Entries.begin(), Entries.end()); |
347 ↗ | (On Diff #189189) | Is there any case !Pair.second is not set? Aren't all blocks in SortedEntries distinct? |
371 ↗ | (On Diff #189189) | This loop does not seem to check if a corresponding Split block for Entry has been already created. In case Entry has more than one predecessors, Split for Entry is gonna be created multiple times and Map[Entry] will be overwritten. We should do something like this inside the loop: if (Map.count(Entry)) continue; |
378 ↗ | (On Diff #189189) | Did you mean Entry->isLayoutSuccessor(Pred) ? In the current state all new blocks are gonna be appended at the end, because this condition is hardly likely to be satisfied. |
390 ↗ | (On Diff #189189) | Nit: There is a [[ https://github.com/llvm/llvm-project/blob/a946997c2482e4386549ee38b4bb154eb58efbb6/llvm/include/llvm/CodeGen/MachineInstrBuilder.h#L405-L410 | BuildMI function ]] that defaults to append instruction at the end, so you can just do BuildMI(Split, DebugLoc(), TII.get(WebAssembly::CONST_I32), Reg) .addImm(Indices[Entry]); BuildMI(Split, DebugLoc(), TII.get(WebAssembly::BR)).addMBB(Dispatch); (Note that Split does not have * anymore) |
test/CodeGen/WebAssembly/irreducible-cfg.ll | ||
224 ↗ | (On Diff #189189) |
|
What changes fixed the previous bug?
Basically the rewrite avoids the entire previous approach of "canonicalization", of representing an inner loop by its entry. When we got that wrong, we could reach very wrong results.
lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp | ||
---|---|---|
30 ↗ | (On Diff #189189) | The NumLoops^2 part? To find mutual loop entries, that is, loop entries that can reach each other, we compare each one to the others. This is actually just the square of the number of loops in a single scope, so nested loops don't count. Anyhow, even in the "worst case" which is a loop after a loop after a loop, this factor is going to be quite small, I'd guess. |
167 ↗ | (On Diff #189189) | When we look at a region, the entry block cannot be a looper, which simplifies things. LLVM doesn't let a function entry be in a loop (maybe for similar reasons as why it's convenient here?), and when we recurse into a loop we ignore backedges to Entry intentionally (so we see the inner part of the loop, and not the looping). In other words, if a block is a looper, it will be seen as such in the outer scope it is in (where it is not the entry). When it is the entry into a region, and we are looking at just that region, it cannot be a looper there. For the second issue (with that example): yeah, the comment was not clear enough, I'll update it. They key is that a loop enterer is a block not in the same loop, that enters it. So here bb2 is a looper, and bb1 is a predecessor of it, and not in the same loop (bb2 can't reach bb1), so we identify bb2 as a loop entry and bb1 as a loop enterer for it. |
201 ↗ | (On Diff #189189) | As a set, it won't have duplicate entries - so if we visit A that causes us to add C to the future work, and then visit B that also causes us to add C to the future work, then when we get to C we'll visit it once, and avoid visiting it again later (and not having anything to do the second time). I haven't benchmarked this recently, but I remember it was beneficial in the Relooper. |
222 ↗ | (On Diff #189189) | Good idea, thanks! |
270 ↗ | (On Diff #189189) | Yeah, it's somewhat debatable (is an inner loop an "inner loop" if there's a branch going into it from several levels up?), but yes, when we see irreducibility we handle it now, instead of leaving it for the inner loop. We have to, I think - if we can branch to it now, we need to fix up that branch now - we wouldn't see it later when we recurse and ignore the outside. At least that would require a very different approach I suppose (which might be more optimal in some cases). |
275 ↗ | (On Diff #189189) | The problem is that the Graph is used outside of the loop in your code. I agree it's a little awkward as it is... another option is to use an std::unique_ptr to allow using it from the outside. But that's less clear I think. Yes, we might find irreducibility again if we have two irreducible loops one after the other (i.e. not nested). It's *possible* we don't need to look at the irreducible one we've already fixed, however, I don't have a proof of that. And we do need to recompute the Graph again after fixing one loop (since it can affect others; and trying to update it is fairly complex an optimization), so starting another loop iteration is the simple thing. The cost is another iteration per irreducible loop, which is not that bad since they are rare. I'll try to improve the comment here. |
286 ↗ | (On Diff #189189) | Good point, this is important to explain, adding. |
347 ↗ | (On Diff #189189) | This is code not changed in this patch (just some variable renaming), but I think you're right, fixing. |
371 ↗ | (On Diff #189189) | This is code that did not change in this patch (aside from names and clang-format), and I'm not sure I fully understand it (I think Dan wrote it?). But I don't think Map[Entry] can be overwritten? The shape of the code is DenseMap<MachineBasicBlock *, MachineBasicBlock *> Map; for (auto *Entry : Pred->successors()) { [..] Map[Entry] = Split; } |
378 ↗ | (On Diff #189189) | As before, this is Dan's code that I didn't change here (aside from variable names). I'm not sure what it does... |
test/CodeGen/WebAssembly/irreducible-cfg.ll | ||
224 ↗ | (On Diff #189189) |
|
Review feedback changes, in particular:
- Remove LoopFixer class.
- Various comment additions and improvements.
- Various nits.
Real nit: We usually start CL/commit headers with [WebAssembly]
lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp | ||
---|---|---|
64 ↗ | (On Diff #190328) |
|
- There are several places in the diff that says "Context not available". I think the diff has some problems. (I thought it needed rebase but maybe it's not relevant) Could you check?
- In this new algorithm, all tests in irreducible-cfg-nested.ll, irreducible-cfg-nested2.ll, and irreducible-cfg-exceptions.ll are reducible now, i.e. this pass does not do anything on them. But some of them are irreducible if we pass -O0. (I only checked func_2 in irreducible-cfg-nested2.ll) I guess we should check these tests and either delete the ones that are reducible or change the flags to llc to -O0 or something to make them irreducible again. Also test3 in irredducible-cfg.ll is now considered as reducible too.
- On second thought, wouldn't it better to just specify -O0 for all irreducibility tests? -O2 basically means we don't now what would happen to our carefully created irreducible CFG.
lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp | ||
---|---|---|
167 ↗ | (On Diff #189189) | Oh I see. Thanks. I didn't know that function entries cannot be loop headers. |
201 ↗ | (On Diff #189189) | But isn't it possible that by the time you try to add C for the second time, the first C has been already erased from the set so you do the unnecessary work again? Usually what I've seen was making worklist as a vector and have a separate set called something like Visited, to which visited blocks are added and not erased. |
270 ↗ | (On Diff #189189) | Is there a condition in the code that we only consider inner loops that have entries from several levels up? I don't think I found one. We just treat all blocks within the region the same way. So, for example, an outer loop consists of blocks A, B, C, D, E, and F, and among them E and F comprises an inner loop. The outer loop has two entries: A and D from the blocks outside, and the inner loop also has two entries: E and F. E's enterer is C and F's enterer is D. In this case, can we solve this case in one pass with a single dispatch block that branches into all of A, D, E, and F? I guess I don't understand the algorithm :( Sorry for that. Please correct me if I'm mistaken. |
275 ↗ | (On Diff #189189) | I see. I missed the graph computation was there too. |
371 ↗ | (On Diff #189189) | Ah right, it's not gonna be overwritten. So the code structure is for (MachineBasicBlock *Pred : AllPreds) { DenseMap<MachineBasicBlock *, MachineBasicBlock *> Map; for (auto *Entry : Pred->successors()) { ... Map[Entry] = Split; } And in case we have entries E0 and E1, and there are predecessors P0 and P1, and the edges are like P0->E0, P0->E1, P1->E0, and P1->E1. So the both preds point to the both entries. In this case, each entry (E0 or E1) is processed twice in the inner for loop. Because Map is created within the outer loop, it's not gonna be overwritten, but Split for each E0 and E1 is gonna be created twice unnecessarily. (I know it's not the part of the code that's changed in this CL; but I thought while we're writing the major part of the pass, maybe we can fix other things too. If you don't prefer that maybe I can submit that as a separate patch. Let me know which one you prefer.) |
378 ↗ | (On Diff #189189) | (Because code parts have moved, this comment is not where it was anymore, so) // This is a successor we need to rewrite. MachineBasicBlock *Split = MF.CreateMachineBasicBlock(); MF.insert(Pred->isLayoutSuccessor(Entry) ? MachineFunction::iterator(Entry) : MF.end(), Split); I think the intention of the original code was, when we insert the Split which is to serve as a stepping stone between Pred and Entry, if currently Pred is right before Entry in the BB order, we want to insert Split between them, which makes a natural order. If not, we just append the new Split at the end of the function. I think this bug was there all along for years, and this was all newly created blocks are appended at the end. I know this is not related to this CL, but if you prefer you can fix it maybe, or I can submit that as a separate patch. |
390 ↗ | (On Diff #189189) | Ping |
230 ↗ | (On Diff #190328) | Nit: Can we take out this and makeSingleEntryLoop from the class definition? I don't think we want to inline these long functions and we can have one less level of indentation outside. I mean, outside the class, bool WebAssemblyFixIrreducibleControlFlow::processRegion(...) { ... } |
test/CodeGen/WebAssembly/irreducible-cfg.ll | ||
224 ↗ | (On Diff #189189) |
; RUN: llc < %s -O0 -asm-verbose=false -verify-machineinstrs -disable-block-placement -wasm-disable-explicit-locals -wasm-keep-registers | FileCheck %s ; RUN: llc < %s -O2 -asm-verbose=false -verify-machineinstrs -disable-block-placement -wasm-disable-explicit-locals -wasm-keep-registers | FileCheck -check-prefix=OPT %s ; OPT-NOT: br_table define hidden void @ps_hints_apply() { ... } ; CHECK-NOT: br_table define hidden i32 @_Z15fannkuch_workerPv(i8* %_arg) { ... } Here is the manual for this -check-prefix option.
|
test/CodeGen/WebAssembly/non-irreducible-cfg.ll | ||
12 ↗ | (On Diff #190328) | Nit: Can we remove hidden and #0? |
lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp | ||
---|---|---|
270 ↗ | (On Diff #189189) | Oh, nevermind. I think I got it. :) Inner loop entries that are not reachable from outside will not be identified as loop entries when we examine the outer region, because those inner loop entries can also reach their preds, as long as the preds are not from outside and in an outer loop. |
259 ↗ | (On Diff #190328) | Nit: How about renaming this to EntriesForSingleLoop or something and start by inserting LoopEntry first? (I don't really like the name EntriesForSingleLoop, but can't think of a better alternative now... I guess there can be better ones) BlockSet EntriesForSingleLoop; EntriesForSingleLoop.insert(LoopEntry); for (auto *OtherLoopEntry : Graph.getLoopEntries()) { if (Graph.canReach(LoopEntry, OtherLoopEntry) && Graph.canReach(OtherLoopEntry, LoopEntry)) EntriesForSingleLoop.insert(OhterLoopEntry); } if (EntriesForSingleLoop.size() > 1) { makeSingleEntryLoop(EntriesForSingleLoop, Blocks, MF); ... } ... That way we don't need to std::move this later and don't need to check if (OtherLoopEntry != LoopEntry). And in my opinion if (EntriesForSingleLoop.size() > 1) makes it clear that we take a multiple entry loop and transform it into a single entry loop. This is just a matter of style preference, so if you don't like it, please ignore it. |
292 ↗ | (On Diff #190328) | we may only alter branches 'in' existing..? |
test/CodeGen/WebAssembly/irreducible-cfg.ll | ||
224 ↗ | (On Diff #189189) | On second thought, wouldn't it better to just specify -O0 for all irreducibility tests? -O2 basically means we don't now what would happen to our carefully created irreducible CFG. |
lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp | ||
---|---|---|
201 ↗ | (On Diff #189189) | The way you suggest is probably faster overall, I agree, I'll change it to that. |
371 ↗ | (On Diff #189189) | I'd prefer we do it separately since this patch already rewrites all the other code in this file, aside from that one unchanged function. Seems safer to change it later. |
390 ↗ | (On Diff #189189) | Same issue as before - I'd prefer to not modify this one function in this patch (which changes everything aside from that one function), for safety. (I did rename a bunch of variables, because I wanted to keep names consistent throughout the file - that seemed necessary in this patch.) |
64 ↗ | (On Diff #190328) | Rebasing - hopefully that fixes the context issue. Yeah, WebAssembly.h was included already, so might be the diff context problem. |
259 ↗ | (On Diff #190328) | I agree the move is not nice. I think we can just add LoopEntry to MutualLoopEntries and simplify things that way, updating the patch with that, let me know what you think. |
292 ↗ | (On Diff #190328) | Thanks, I'll fix that. |
test/CodeGen/WebAssembly/irreducible-cfg.ll | ||
224 ↗ | (On Diff #189189) | Removed hidden, and changed to -O0 in irreducible-cfg. Yeah, that seems like the right thing to do for irreducibility tests. I didn't want to change irreducible-cfg-exceptions.ll's optimization level though because I'm not sure that wouldn't invalidate it. |
Various changes from the feedback:
- Change how the LoopBlocks work list works, using a set+vector instead of just a set.
- Move large functions out of the class definition.
- Avoid a move in MutualEntries, instead have them always contain all the mutual entries (which includes the original entry, as it is mutual to itself).
- Test cleanup; use -O0 for the main irreducibility tests; remove old irreducibility tests which were invalid.
- Comment improvements.
- Sorry, but at least some of the tests in irreducible-cfg-nested.ll and irreducible-cfg-nested2.ll were still irreducible if we passed -O0. Do we want to delete all of them? Shouldn't we keep the ones that are still irreducible with -O0?
- Real nit: We usually start CL/commit headers with [WebAssembly]
lib/Target/WebAssembly/WebAssemblyFixIrreducibleControlFlow.cpp | ||
---|---|---|
259 ↗ | (On Diff #190328) | I think this is better, thanks. |
test/CodeGen/WebAssembly/irreducible-cfg.ll | ||
---|---|---|
288 ↗ | (On Diff #190491) | Nit: newline at the end |
Updates:
- Thanks, indeed one of the tests removed before (func_2) was still irreducible at -O0, restored it into the main irreducible file test, which is now at -O0. The other (tre_parse) was actually not irreducible, so we mis-identified it, and I left it out.
- Fixed extra newline at the end of a test.
- Renamed the title to have [WebAssembly]
LGTM, thank you. Do you want me to commit this for you? Or are you gonna get a commit access?
Great, and thanks for the comments!
I should probably get commit access, yeah :) Can you please commit this one, though, so it doesn't wait on that?