This is an archive of the discontinued LLVM Phabricator instance.

[LegalizeTypes] Improve expansion of wide SMIN/SMAX/UMIN/UMAX
ClosedPublic

Authored by efriedma on May 24 2023, 11:37 AM.

Details

Summary

The current implementation tries to handle the high and low halves separately, but that's less efficient in most cases; use a wide SETCC instead.

Still some small regressions scattered across the testcases... the most concerning are AMDGPU and VE, where apparently this actually makes things worse somehow in the general case. Do we need a target hook, or is there some way to use a unified codepath?

Diff Detail

Event Timeline

efriedma created this revision.May 24 2023, 11:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2023, 11:37 AM
Herald added subscribers: luke, pmatos, asb and 32 others. · View Herald Transcript
efriedma requested review of this revision.May 24 2023, 11:37 AM
arsenm added inline comments.May 24 2023, 11:46 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2955–2960

Should do the RHS check first

llvm/test/CodeGen/AMDGPU/min.ll
568–573

I'm guessing you need to check if the wide setcc is legal/good

craig.topper added inline comments.May 24 2023, 11:48 AM
llvm/test/CodeGen/RISCV/fpclamptosat.ll
1953

I wonder why this slti+xori+beqz didn't simplify.

efriedma added inline comments.May 24 2023, 11:53 AM
llvm/test/CodeGen/AMDGPU/min.ll
552–557

Looked more carefully at the output; I don't think this is actually a regression, the total instructions appears the same.

Before:

  LSHR * T0.X, KC0[2].Y, literal.x,
2(2.802597e-45), 0(0.000000e+00)
  SETGE_UINT * T0.W, KC0[3].X, KC0[3].Z,
  CNDE_INT   T0.Z, PV.W, KC0[2].W, KC0[3].Y,
  MIN_UINT * T0.W, KC0[2].W, KC0[3].Y,
  SETE_INT * T1.W, KC0[3].X, KC0[3].Z,
  CNDE_INT   T1.X, PV.W, T0.Z, T0.W,
  MIN_UINT * T1.Y, KC0[3].X, KC0[3].Z,

After:

  SETE_INT   T0.Z, KC0[3].X, KC0[3].Z,
  SETGT_UINT * T0.W, KC0[3].Z, KC0[3].X,
  SETGT_UINT * T1.W, KC0[3].Y, KC0[2].W,
  CNDE_INT * T0.W, T0.Z, T0.W, PV.W,
  CNDE_INT * T0.Y, PV.W, KC0[3].Z, KC0[3].X,
  CNDE_INT * T0.X, T0.W, KC0[3].Y, KC0[2].W,
  LSHR * T1.X, KC0[2].Y, literal.x,
2(2.802597e-45), 0(0.000000e+00)
llvm/test/CodeGen/AMDGPU/r600-legalize-umax-bug.ll
23 ↗(On Diff #525276)

Regression.

llvm/test/CodeGen/ARM/fpclamptosat.ll
2204

Regression. Might need to special-case the constant 4294967295.

craig.topper added inline comments.May 24 2023, 12:00 PM
llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
3023–3057

I think this should be (a <= b ? a : b) for some constants. Like @llvm.smin.i64(i64 %conv, i64 2147483647)

efriedma added inline comments.May 24 2023, 12:01 PM
llvm/test/CodeGen/AMDGPU/min.ll
568–573

Determining "good" is tricky... RISCV doesn't have any special instructions for wide setcc, but this change is still profitable there.

efriedma added inline comments.May 24 2023, 1:06 PM
llvm/test/CodeGen/RISCV/fpclamptosat.ll
1953

The sequence doesn't actually exist in this form until very late, in block-placement. In SelectionDAG, we just have a RISCVISD::SELECT_CC used as the condition of another RISCVISD::SELECT_CC , which obscures the sequence. (For similar reasons, we have sltu+bnez instead of bltu.)

Not sure what the best solution looks like. I guess for 64-bit compares in particular, you could define a pseudo-instruction SELECTCC_64BIT or something like that. Maybe more generally, it would help if we transform select(select(a,b,c),d,e) to select(a, select(b,d,e), select(c,d,e)).

efriedma added inline comments.May 24 2023, 1:18 PM
llvm/test/CodeGen/RISCV/min-max.ll
80

It's easier to see the reason we end up with slt+beqz/sltu+bnez/etc. here. Immediately after isel, we have one triangle followed by another triangle. Later folding mashes the code together. You can see the original form if you pass this function to "llc -opt-bisect-limit=0":

smax_i64:                               # @smax_i64
        .cfi_startproc
# %bb.0:
        slt     a5, a3, a1
        sltu    a4, a2, a0
        beq     a1, a3, .LBB0_2
# %bb.1:
        mv      a4, a5
.LBB0_2:
        bnez    a4, .LBB0_4
# %bb.3:
        mv      a0, a2
        mv      a1, a3
.LBB0_4:
        ret
efriedma updated this revision to Diff 525368.May 24 2023, 4:47 PM

Added a bunch of special cases.

arsenm added inline comments.Jun 5 2023, 11:11 AM
llvm/test/CodeGen/AMDGPU/max.ll
291–296

Even if there were a regression, this is all r600 only which nobody cares enough about to worry about

The x86 changes look OK - I suppose its just the VE regressions that need further investigation?

As far as I can tell, the issue for VE is just that an i128 icmp is lowered in an extremely inefficient way. There's no add-with-carry, and no custom lowering, so we end up using the generic expansion. And the generic expansion is expensive because setcc is relatively expensive. Looking at the instruction set, though, it should be possible to compute the flags for a 128-bit compare in three instructions: cmpu+cmpu+cmov; if that were implemented, the new expansion would actually be an improvement.

Given that, I think it makes sense to just move forward with this as-is. The regression will be automatically fixed when someone adds appropriate custom lowering for 128-bit math on VE. It would be nice to get feedback from someone more familiar with the VE instruction set, though.

RKSimon accepted this revision.Jun 21 2023, 2:10 AM

In which case, if you raise a VE backend issue for better i128 icmp codegen then I don't see why this can't be committed.

This revision is now accepted and ready to land.Jun 21 2023, 2:10 AM
This revision was landed with ongoing or failed builds.Jun 26 2023, 10:46 AM
This revision was automatically updated to reflect the committed changes.
bjope added a subscriber: bjope.Jun 27 2023, 3:18 AM

As far as I can tell, the issue for VE is just that an i128 icmp is lowered in an extremely inefficient way. There's no add-with-carry, and no custom lowering, so we end up using the generic expansion. And the generic expansion is expensive because setcc is relatively expensive. Looking at the instruction set, though, it should be possible to compute the flags for a 128-bit compare in three instructions: cmpu+cmpu+cmov; if that were implemented, the new expansion would actually be an improvement.

Given that, I think it makes sense to just move forward with this as-is. The regression will be automatically fixed when someone adds appropriate custom lowering for 128-bit math on VE. It would be nice to get feedback from someone more familiar with the VE instruction set, though.

I think we got the same (or similar) problems downstream.

For a smaller example (like the one that you added in https://github.com/llvm/llvm-project/issues/63435) we get this in the DAG after type legalization + combine:

     t36: i16 = setcc t9, t13, setugt:ch
     t37: i16 = setcc t7, t11, setugt:ch
   t43: i16 = select_cc t7, t11, t36, t37, seteq:ch
t35: i32 = select_cc t43, Constant:i16<0>, t4, t2, setne:ch

which basically lowers to cmpu, condmove, cmpu, condmove, cmps, condmove, cmps, condmove. And each condmove will end up as two moves in the final asm. So that pattern is currently quite expense.
Not sure exactly how we can improve that (one flag register and I think doing conditional compares in the DAG is a bit tricky). So those kinds of constructs are a bit messy and costly to handle.

The old expansion of min/max when we got

X: i32 = smax t3, t7
  t39: i32 = umax t5, t9
  t42: i32 = select_cc t3, t7, t5, t9, setgt:ch
Y: i32 = select_cc t3, t7, t39, t42, seteq:ch

instead of

    t41: i16 = setcc t5, t9, setugt:ch
    t42: i16 = setcc t3, t7, setgt:ch
  t49: i16 = select_cc t3, t7, t41, t42, seteq:ch
X: i32 = select t49, t3, t7
Y: i32 = select t49, t5, t9

seemed better considering that the smax and umax are legal for our target (while the setcc+setcc+select_cc pattern is expensive). Also the condmove:s that we get from the DAG with smax/umax is just selecting between the existing register values. The condmoves that operate on boolean contents (originating from the setcc pattern) also involves materializing the boolean constants in registers...

I guess I can try to add some custom legalization to get the old result downstream. Or maybe a reverse DAG combine (trying to detect that X simply is an smax out from that select+select_cc+setcc+setcc pattern).
Or maybe there should be some target lowering hooks to let targets override and use the old expansion instead of that new part that introduce the SETCC with non-legal types when expanding min/max?

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2994

It is a bit unclear to me if this just adds heuristics ("it it appears profitable"), or if something inside the if actually is guarded by the conditions in this if.

To be more specific, if I still want to expand like this for a downstream target, can I just change it to

// Heuristic to decide if it is profitable to 
bool IsProfitable =
  (RHSVal && (N->getOpcode() == ISD::UMIN || N->getOpcode() == ISD::UMAX) &&
                 (RHSVal->countLeadingOnes() >= NumHalfBits ||
                  RHSVal->countLeadingZeros() >= NumHalfBits);
if (IsProfitable || TLI->avoidSetccInMinMaxExpand()) {
  ...
}

I think any of the following makes the new expansion cheaper:

  • A subtract-with-borrow/carry instruction
  • A cheap setcc/select
  • Some cheap way to do a "select" on flags.
  • Branches are reasonably cheap, so you can expand the select-of-select construct to branches.

If your target has none of those, a target lowering hook is reasonable. Seems a little surprising to me, though; all targets I know of with a dedicated "flags" register have a subtract-with-borrow.

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2994

The expansion is fully general; you could use it for anything. The guard is just checking for profitability (whether the upper umin/umax will simplify).

bjope added a comment.Jun 28 2023, 6:47 AM

I think any of the following makes the new expansion cheaper:

  • A subtract-with-borrow/carry instruction
  • A cheap setcc/select
  • Some cheap way to do a "select" on flags.
  • Branches are reasonably cheap, so you can expand the select-of-select construct to branches.

If your target has none of those, a target lowering hook is reasonable. Seems a little surprising to me, though; all targets I know of with a dedicated "flags" register have a subtract-with-borrow.

I looked a bit at what happens for other targets, and I see that the SETCC often ends up being expanded into USUBO+SETCCCARRY. Maybe I could add some support for those DAG nodes, but I doubt that the end result would be good for the min/max cases anyway.

A nice lowering for a VLIW target (with min/max machine instructions, predication, etc,) would be something like this for unsigned max:

{
  umax x.hi, y.hi => res.hi          ; res.hi
  umax x.lo, y.lo => res.lo          ; res.lo when x.hi==y.hi
  compare x.hi, y.hi => CC
}
{
  condmove x.lo => res.lo  if CC.ugt     ; res.lo when x.hi "won"
  condmove y.lo => res.lo, if CC.ult     ; res.lo when y.hi "won"
}

So that would be just two bundles, no branches and only a single compare to update CC.
It also matches quite good with the old expansion on a selection DAG level.

I figure I would need to find a way to map those setcc/selectcc operations back to using max/min to get something nearly as good as that (at least if being limited by a single CC flag register).

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2994

Thanks for confirming.

That makes it easy for us to add our target to the profitability check to use the old expansion in more cases.

I'll handle that downstream for now, since I don't know about any in-tree target that I could use in a test case to show the need for a target lowering hook.