This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Avoid quadratic on a predecessors number behavior in instruction sinking.
ClosedPublic

Authored by mzolotukhin on Dec 18 2017, 10:47 AM.

Details

Summary

If a block has N predecessors, then the current algorithm will try to
sink common code to this block N times (whenever we visit a
predecessor). Every attempt to sink the common code includes going
through all predecessors, so the complexity of the algorithm becomes
O(N^2).
With this patch we try to sink common code only when we visit the block
itself. With this, the complexity goes down to O(N).
As a side effect, the moment the code is sunk is slightly different than
before (the order of simplifications has been changed), that's why I had
to adjust two tests (note that neither of the tests is supposed to test
SimplifyCFG):

  • test/CodeGen/AArch64/arm64-jumptable.ll - changes in this test mimic

the changes that previous implementation of SimplifyCFG would do.

  • test/CodeGen/ARM/avoid-cpsr-rmw.ll - in this test I disabled common

code sinking by a command line flag.

Event Timeline

mzolotukhin created this revision.Dec 18 2017, 10:47 AM

This also fixes a big compile time regression we found that was caused by rL279460 (and consequent follow-ups). In the problematic test-case we have a huge function, where one of the blocks have more than 1000 predecessors.

test/CodeGen/AArch64/arm64-jumptable.ll - changes in this test mimic the changes that previous implementation of SimplifyCFG would do.

With your patch, what transform do we perform here instead of sinking? As far as I can tell, SimplifyCFG doesn't make any other changes.

lib/Transforms/Utils/SimplifyCFG.cpp
1657

*predecessors

test/CodeGen/AArch64/arm64-jumptable.ll - changes in this test mimic the changes that previous implementation of SimplifyCFG would do.

With your patch, what transform do we perform here instead of sinking? As far as I can tell, SimplifyCFG doesn't make any other changes.

I don't think SimplifyCFG makes any other changes, but the test isn't for SimplifyCFG - as far as I understand it, it's for instructions we generate in the backend. The test was just written to trigger some specific situation, and with this change it stopped triggering it.

Michael

I don't think SimplifyCFG makes any other changes, but the test isn't for SimplifyCFG - as far as I understand it, it's for instructions we generate in the backend. The test was just written to trigger some specific situation, and with this change it stopped triggering it.

Yes, I'm not really worried about that aspect; I'm more concerned that the transform somehow isn't triggering in cases where it should.

I just tried digging a bit; it looks like the generated code is worse, particularly for avoid-cpsr-rmw.ll. I'm not sure that's really the fault of this patch, exactly... it looks like the sinking does in fact happen, but the end result is different because of related transforms. Looks like it's exposing a bad heuristic elsewhere?

Yes, I'm not really worried about that aspect; I'm more concerned that the transform somehow isn't triggering in cases where it should.

I just tried digging a bit; it looks like the generated code is worse, particularly for avoid-cpsr-rmw.ll. I'm not sure that's really the fault of this patch, exactly... it looks like the sinking does in fact happen, but the end result is different because of related transforms. Looks like it's exposing a bad heuristic elsewhere?

Yes, it does behave differently after the patch. However, from code sinking point of view the new behavior seems better to me.

Here is how the behavior changed for arm64-jumptable.ll. Previously SimplifyCFG produced

define void @sum(i32 %a, i32* %to, i32 %c) {
entry:
  switch i32 %a, label %exit [
    i32 1, label %bb1
    i32 2, label %exit.sink.split
    i32 3, label %bb3
    i32 4, label %bb4
  ]
bb1:
  %b = add i32 %c, 1
  br label %exit.sink.split
bb3:
  br label %exit.sink.split
bb4:
  br label %exit.sink.split
exit.sink.split:
  %.sink = phi i32 [ 5, %bb4 ], [ %b, %bb1 ], [ 3, %bb3 ], [ %a, %entry ]
  store i32 %.sink, i32* %to
  br label %exit
exit:
  ret void
}

Now we're producing:

define void @sum(i32 %a, i32* %to, i32 %c) {
entry:
  switch i32 %a, label %exit [
    i32 1, label %bb1
    i32 2, label %exit.sink.split
    i32 3, label %exit.sink.split
    i32 4, label %bb4
  ]
bb1:                                              ; preds = %entry
  %b = add i32 %c, 1
  br label %exit.sink.split
bb4:                                              ; preds = %entry
  br label %exit.sink.split
exit.sink.split:                                  ; preds = %entry, %entry, %bb1, %bb4
  %.sink = phi i32 [ 5, %bb4 ], [ %b, %bb1 ], [ %a, %entry ], [ %a, %entry ]
  store i32 %.sink, i32* %to
  br label %exit
exit:                                             ; preds = %exit.sink.split, %entry
  ret void
}

So, we merged edges entry->bb3->exit.sink.split. Except for helping backend, I don't see why we would want to keep these edges separate. Helping backend might be a sufficient reason, but I think it's generally better not to rely on this kind of stuff in the backend.

In avoid-cpsr-rmw.ll we previously generated:

define void @t4(i32* nocapture %p, double* nocapture %q) #3 {
entry:
  %0 = load double, double* %q, align 4
  %cmp = fcmp olt double %0, 1.000000e+01
  %incdec.ptr1 = getelementptr inbounds i32, i32* %p, i32 1
  br i1 %cmp, label %if.then, label %if.end
if.then:                                          ; preds = %entry
  br label %if.end
if.end:                                           ; preds = %entry, %if.then
  %.sink3 = phi i32 [ 7, %if.then ], [ 3, %entry ]
  %.sink2 = phi i32 [ 2, %if.then ], [ 3, %entry ]
  %.sink1 = phi i32 [ 8, %if.then ], [ 5, %entry ]
  %.sink = phi i32 [ 9, %if.then ], [ 6, %entry ]
  store i32 %.sink3, i32* %p, align 4
  %incdec.ptr5 = getelementptr inbounds i32, i32* %p, i32 %.sink2
  store i32 %.sink1, i32* %incdec.ptr1, align 4
  store i32 %.sink, i32* %incdec.ptr5, align 4
  ret void
}

And now we generate:

define void @t4(i32* nocapture %p, double* nocapture %q) #3 {
entry:
  %0 = load double, double* %q, align 4
  %cmp = fcmp olt double %0, 1.000000e+01
  %incdec.ptr1 = getelementptr inbounds i32, i32* %p, i32 1
  %. = select i1 %cmp, i32 7, i32 3
  %.4 = select i1 %cmp, i32 2, i32 3
  %.5 = select i1 %cmp, i32 8, i32 5
  %.6 = select i1 %cmp, i32 9, i32 6
  store i32 %., i32* %p, align 4
  %incdec.ptr5 = getelementptr inbounds i32, i32* %p, i32 %.4
  store i32 %.5, i32* %incdec.ptr1, align 4
  store i32 %.6, i32* %incdec.ptr5, align 4
  ret void
}

Since we're not changing logic for generating selects and we're ending up with them now, I assume that it's also a desirable form of IR in this case.

If that's not what we want to do in SimplifyCFG, I can investigate it further and try to fix.

Thanks,
Michael

The change to arm64-jumptable.ll looks fine. I think something is going wrong in avoid-cpsr-rmw.ll.

Since we're not changing logic for generating selects and we're ending up with them now, I assume that it's also a desirable form of IR in this case.

We have thresholds in various places to avoid transforming control flow into straight-line computation if it gets too expensive; they're scattered all over the place, though, so we might not handle it consistently.

hfinkel added a subscriber: Carrot.Dec 20 2017, 4:10 PM

The change to arm64-jumptable.ll looks fine. I think something is going wrong in avoid-cpsr-rmw.ll.

Since we're not changing logic for generating selects and we're ending up with them now, I assume that it's also a desirable form of IR in this case.

We have thresholds in various places to avoid transforming control flow into straight-line computation if it gets too expensive; they're scattered all over the place, though, so we might not handle it consistently.

Exactly. Also, @Carrot has been working on this problem recently (see D39352). Our heuristics in this regard should soon get significantly smarter (although it's not clear that it will, or should, change the avoid-cpsr-rmw.ll case). @efriedma, based on the IR above, it does not seem that what's happening is wrong (the branchy code does not seem obviously superior to me because it's not clear we're missing opportunities for speculative execution). Can you elaborate?

The question is how you materialize all the immediates. On ARM, at least, the "select" version generates two instructions per PHI node, with a sequence like "mov imm; mowmi imm; mov imm; movwmi imm" etc. while the branchy version generates a branch, where each block has one instruction per PHI node. AArch64 is similar (except that we use "csel" there). x86 is similar; we generate a "mov+mov+cmov" sequence, except for some special cases which get lowered to "lea".

So clearly there's some threshold where the select-based lowering is worse, simply because the processor is executing more instructions. Of course, the exact threshold where the branch is faster probably depends on the target, the length of the critical path, and how predictable the branch is.

The question is how you materialize all the immediates. On ARM, at least, the "select" version generates two instructions per PHI node, with a sequence like "mov imm; mowmi imm; mov imm; movwmi imm" etc. while the branchy version generates a branch, where each block has one instruction per PHI node. AArch64 is similar (except that we use "csel" there). x86 is similar; we generate a "mov+mov+cmov" sequence, except for some special cases which get lowered to "lea".

So clearly there's some threshold where the select-based lowering is worse, simply because the processor is executing more instructions. Of course, the exact threshold where the branch is faster probably depends on the target, the length of the critical path, and how predictable the branch is.

Good point. I don't know if we have a heuristic that deals with that specifically, but it seems like we should. Should we move forward with this patch, file a bug, and then work on this in follow-up? Fixing the O(N^2) behavior here seems like something we should keep as separate as possible from changing the heuristics.

efriedma accepted this revision.Dec 20 2017, 4:54 PM

OK, fixing the heuristic doesn't need to block this fix. Sinking itself appears to be working as expected, and I don't think this particular case of a bunch of PHI nodes is likely to show up often in practice.

LGTM.

This revision is now accepted and ready to land.Dec 20 2017, 4:54 PM
mzolotukhin closed this revision.Dec 20 2017, 5:24 PM
mzolotukhin marked an inline comment as done.

Thanks for review!
Phabricator didn't catch the review thread. The patch was committed in rL321236.

Michael