This is an archive of the discontinued LLVM Phabricator instance.

Handle simple diamond CFG hoisting in DivRemPairs.
ClosedPublic

Authored by resistor on Dec 23 2022, 8:26 PM.

Details

Summary

Previous we only handled triangle CFGs. This patch expands that
to support diamonds, where the div and rem appear in the then/else
sides of a condition. In that case, we can hoist the div into the
shared predecessor.

This could be generalized further to use nearest common ancestors,
but some of the conditions for hoisting would then require
post-dominator information.

Diff Detail

Event Timeline

resistor created this revision.Dec 23 2022, 8:26 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 23 2022, 8:26 PM
resistor requested review of this revision.Dec 23 2022, 8:26 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 23 2022, 8:26 PM
lebedev.ri accepted this revision.Dec 24 2022, 8:53 AM
lebedev.ri added a subscriber: lebedev.ri.

LG

llvm/lib/Transforms/Scalar/DivRemPairs.cpp
278
This revision is now accepted and ready to land.Dec 24 2022, 8:53 AM
barannikov88 added a subscriber: barannikov88.EditedDec 24 2022, 9:36 AM

The changes in the tests actually show that this transformation is unprofitable.
Unlike the case above (where Div postdominates both PredBB and Rem), hosting the Div increases may increase critical path length.

nikic added a comment.Dec 24 2022, 9:39 AM

Hm, could you provide some more information on what the motivation for this transform is? If we don't have a domination relationship, then we won't be able to reuse a result we have to compute anyway, so is the intention here a code size improvement for the case where a divrem op exists, because we now only need to materialize the instruction once? In that case though, does this transform make sense if we are going to use expanded form, as in the PowerPC example? We are replacing a div/rem on disjoint code paths with a common div plus mul+sub on one code path, which seems strictly worse?

By hoisting div/rem that would be executed anyway,
we allow it to be issued earlier than it would have,
and assuming that the inputs were already ready,
the instructions that use it's result may have to wait
less for said result to be available.

The changes in the tests actually show that this transformation is unprofitable.
Unlike the case above (where Div postdominates both PredBB and Rem), hosting the Div increases may increase critical path length.

Div/rem are usually *extremely* slow, so that code was in a bad spot already.

Hm, could you provide some more information on what the motivation for this transform is?

If we don't have a domination relationship, then we won't be able to reuse a result we have to compute anyway

I don't follow at all. We clearly dominate both of the source blocks?

so is the intention here a code size improvement for the case where a divrem op exists,
because we now only need to materialize the instruction once? In that case though,
does this transform make sense if we are going to use expanded form, as in the PowerPC example?
We are replacing a div/rem on disjoint code paths with a common div plus mul+sub on one code path, which seems strictly worse?

Div/rem are usually *extremely* slow, so that code was in a bad spot already.

And we're going to make it twice as bad by hoisting the Div that might not be executed at all in the original case.
I.e. in the original case we execute either div or rem, but not both. After the transformation we execute div unconditionally, and additionally mul+sub in one of the cases.

I could agree that the latency issue may prevail in the case when rem is guaranteed to be expanded into div+mul+sub (and not into a libcall, for example).
Otherwise we're only going to increase the latency on the rem path.
However, this case should already be handled by MachineCSE, which has more context information such as register pressure.

May i suggest that reviewers first familiarize themselves
with the code they are reviewing, in particular with the
preconditions on the transformations,
in particular with the hasDivRemOp() TTI hook?
Hint: https://godbolt.org/z/K3dcxqEhE

craig.topper added inline comments.
llvm/test/Transforms/DivRemPairs/X86/div-expanded-rem-pair.ll
181

Isn't this the same as T3? Doesn't entry dominate this block?

lebedev.ri added inline comments.Dec 24 2022, 12:26 PM
llvm/test/Transforms/DivRemPairs/X86/div-expanded-rem-pair.ll
181

We only deal with a single div-rem pair (in the sense,
we don't deal with the case of multiple identical div's or rem's),
and we don't CSE div's/rem's, and only do a single pass over the function,
so this indeed falls through cracks.

May i suggest that reviewers first familiarize themselves
with the code they are reviewing, in particular with the
preconditions on the transformations,
in particular with the hasDivRemOp() TTI hook?
Hint: https://godbolt.org/z/K3dcxqEhE

The preconditions say:

// If the target supports div+rem and the instructions are in the same block
// already, there's nothing to do. The backend should handle this. If the
// target does not support div+rem, then we will decompose the rem.
if (HasDivRemOp && RemInst->getParent() == DivInst->getParent())
  continue;

So your example is irrelevant.

The relevant example is: https://godbolt.org/z/roGW5G54f
As you can see, in the case 'before' there are 3 instructions on each path.
In the case 'after' (which corresponds to the suggested transformation), the 'div' path contains 4 instructions and the 'rem' path contains 6 instructions.

I'm guessing what other reviewers are trying to say,
is that we might be missing something to to effect of:

if(DivBB != PredBB && RemBB != PredBB && !HasDivRemOp)
  continue; // Don't hoist both into predecessor if we don't have divrem instruction.

But it's not obvious to me if that it's actually worse,
because instruction counting is not really the right way
to gauge code performance.

Div/Rem is usually *really* that bad. We move really really
costly instruction from both branches to their predecessor

  1. if branch is mispredicted, we won't have to recompute it
  2. if the inputs are already avaliable, it can start executing earlier than it would have in the branches, thus we hide some of it's latency
  3. in turn, instructions that depend on results of those instructions may get results earlier, and start executing earlier

I'm guessing what other reviewers are trying to say,
is that we might be missing something to to effect of:

if(DivBB != PredBB && RemBB != PredBB && !HasDivRemOp)
  continue; // Don't hoist both into predecessor if we don't have divrem instruction.

But it's not obvious to me if that it's actually worse,
because instruction counting is not really the right way
to gauge code performance.

Div/Rem is usually *really* that bad. We move really really
costly instruction from both branches to their predecessor

  1. if branch is mispredicted, we won't have to recompute it
  2. if the inputs are already avaliable, it can start executing earlier than it would have in the branches, thus we hide some of it's latency
  3. in turn, instructions that depend on results of those instructions may get results earlier, and start executing earlier

Continuing with Sergei's example, the transformation *is* profitable from a code size and latency perspective on x86: https://godbolt.org/z/xo7P1MnGr
It seems like the simplest path forward here is to conditionalize the transformation as you suggest. I will look into this.

Hm, could you provide some more information on what the motivation for this transform is? If we don't have a domination relationship, then we won't be able to reuse a result we have to compute anyway, so is the intention here a code size improvement for the case where a divrem op exists, because we now only need to materialize the instruction once? In that case though, does this transform make sense if we are going to use expanded form, as in the PowerPC example? We are replacing a div/rem on disjoint code paths with a common div plus mul+sub on one code path, which seems strictly worse?

Note that there is actually a domination requirement, just not tested via dominator tree. See this example on x86 that shows where this is profitable for code size: https://godbolt.org/z/xo7P1MnGr

if(DivBB != PredBB && RemBB != PredBB && !HasDivRemOp)
  continue; // Don't hoist both into predecessor if we don't have divrem instruction.

That sounds fine to me, too.
@resistor
Could you please "port" PPC's @no_domination test to a couple of more targets to make sure there are no regressions? E.g. AArch64 (no rem), RISCV (rem, but no divrem) and MSP430 (libcall rem, no divrem)?

resistor updated this revision to Diff 485215.Dec 24 2022, 8:40 PM

Conditionalize the transformation on HasDivRemOp and add testcases.

Thanks for the update. I have two more minor suggestions:

  1. I believe we are missing negative tests for the case where either a) the div or rem block don't have a unique predecessor or b) the common predecessor has other successors. (Please let me know if I just missed them, there's a lot of test files...)
  1. I would suggest a code comment on why we are doing this transform, because the motivation here is somewhat different than for all the other cases this pass handles (in other cases, we end up not executing a div/rem operation on some code path, while here we have a scheduling / code size optimization).

In addition to that, I believe this transform has multiple pre-existing correctness issues. These are kind of orthogonal to your patch, but I noticed them while reviewing the code:

  1. If the terminator of the predecessor is not guaranteed to transfer, then we might hoist a dynamically dead div/rem, causing undefined behavior. In particular, this can happen with an invoke of a non-willreturn function.
  1. If the predecessor is a catchswitch pad, it is not legal to hoist into it.
  1. With the pending changes to callbr semantics (https://reviews.llvm.org/D135997), if callbr defines one of the div/rem operands, it may not be available after hoisting. This is not a problem yet though, so this is mostly something for @nickdesaulniers to keep an eye on. This might be a problem in other transforms as well.
resistor updated this revision to Diff 485265.Dec 25 2022, 8:00 PM

Add negative tests. Enhance comments to explain the profitability heuristic.
Attempt to fix pre-existing correctness concerns WRT non-transferring predecessors.

nikic added a comment.Dec 27 2022, 6:06 AM

Can you please add these additional tests for the invoke/catchswitch cases?

declare void @dummy()

define i32 @invoke_not_willreturn(i32 %a, i32 %b) personality ptr null {
; CHECK-LABEL: @invoke_not_willreturn(
; CHECK-NEXT:  entry:
; CHECK-NEXT:    invoke void @dummy()
; CHECK-NEXT:    to label [[CONT:%.*]] unwind label [[LPAD:%.*]]
; CHECK:       cont:
; CHECK-NEXT:    [[DIV:%.*]] = sdiv i32 [[A:%.*]], [[B:%.*]]
; CHECK-NEXT:    ret i32 [[DIV]]
; CHECK:       lpad:
; CHECK-NEXT:    [[TMP0:%.*]] = landingpad { ptr, i32 }
; CHECK-NEXT:    cleanup
; CHECK-NEXT:    [[REM:%.*]] = srem i32 [[A]], [[B]]
; CHECK-NEXT:    ret i32 [[REM]]
;
entry:
  invoke void @dummy()
  to label %cont unwind label %lpad

cont:
  %div = sdiv i32 %a, %b
  ret i32 %div

lpad:
  landingpad { ptr, i32 }
  cleanup
  %rem = srem i32 %a, %b
  ret i32 %rem
}

define i32 @invoke_willreturn(i32 %a, i32 %b) personality ptr null {
; CHECK-LABEL: @invoke_willreturn(
; CHECK-NEXT:  entry:
; CHECK-NEXT:    [[DIV:%.*]] = sdiv i32 [[A:%.*]], [[B:%.*]]
; CHECK-NEXT:    [[REM:%.*]] = srem i32 [[A]], [[B]]
; CHECK-NEXT:    invoke void @dummy() #[[ATTR0:[0-9]+]]
; CHECK-NEXT:    to label [[CONT:%.*]] unwind label [[LPAD:%.*]]
; CHECK:       cont:
; CHECK-NEXT:    ret i32 [[DIV]]
; CHECK:       lpad:
; CHECK-NEXT:    [[TMP0:%.*]] = landingpad { ptr, i32 }
; CHECK-NEXT:    cleanup
; CHECK-NEXT:    ret i32 [[REM]]
;
entry:
  invoke void @dummy() willreturn
  to label %cont unwind label %lpad

cont:
  %div = sdiv i32 %a, %b
  ret i32 %div

lpad:
  landingpad { ptr, i32 }
  cleanup
  %rem = srem i32 %a, %b
  ret i32 %rem
}

; Use this personality function so that catchpad is guaranteed to transfer.
declare void @ProcessCLRException()

define i32 @catchswitch(i32 %a, i32 %b) personality ptr @ProcessCLRException {
; CHECK-LABEL: @catchswitch(
; CHECK-NEXT:  entry:
; CHECK-NEXT:    invoke void @dummy() #[[ATTR0]]
; CHECK-NEXT:    to label [[CONT:%.*]] unwind label [[LPAD:%.*]]
; CHECK:       cont:
; CHECK-NEXT:    ret i32 0
; CHECK:       lpad:
; CHECK-NEXT:    [[CS:%.*]] = catchswitch within none [label %cp] unwind label [[LPAD_END:%.*]]
; CHECK:       cp:
; CHECK-NEXT:    [[TMP0:%.*]] = catchpad within [[CS]] []
; CHECK-NEXT:    [[DIV:%.*]] = sdiv i32 [[A:%.*]], [[B:%.*]]
; CHECK-NEXT:    ret i32 [[DIV]]
; CHECK:       lpad.end:
; CHECK-NEXT:    [[TMP1:%.*]] = cleanuppad within none []
; CHECK-NEXT:    [[REM:%.*]] = srem i32 [[A]], [[B]]
; CHECK-NEXT:    ret i32 [[REM]]
;
entry:
  invoke void @dummy() willreturn
  to label %cont unwind label %lpad

cont:
  ret i32 0

lpad:
  %cs = catchswitch within none [label %cp] unwind label %lpad.end

cp:
  catchpad within %cs []
  %div = sdiv i32 %a, %b
  ret i32 %div

lpad.end:
  cleanuppad within none []
  %rem = srem i32 %a, %b
  ret i32 %rem
}

The catchswitch exception was a bit tricky to come up with, because we need both the right personality function and no unwind to caller, otherwise we will fail transfer checks already.

llvm/lib/Transforms/Scalar/DivRemPairs.cpp
284

"extra Div" -> Mul + Sub? I don't think we'd perform an actual div.

285

nit: BasicBlock *

287

nit: No braces for single-line if.

294

Hm, this is stricter than what is needed, most exception pads are fine. It looks like we don't really have a dedicated method for this, so just checking !isa<CatchSwitchInst>(PredBB->getTerminator()) should be fine.

llvm/test/Transforms/DivRemPairs/X86/div-rem-pairs.ll
411

I don't think this tests the right case, because the udiv/urem ops are not the same. We can't have the ops be derived from a phi.

resistor updated this revision to Diff 485456.Dec 27 2022, 9:10 PM

Update based on feedback.

resistor marked 5 inline comments as done.Dec 27 2022, 9:10 PM
nikic accepted this revision.Dec 28 2022, 1:08 AM

LGTM, thanks!

llvm/lib/Transforms/Scalar/DivRemPairs.cpp
278

nit: and and

This revision was landed with ongoing or failed builds.Dec 28 2022, 10:24 AM
This revision was automatically updated to reflect the committed changes.