Page MenuHomePhabricator

[InstCombine] Combine lshr of add -> (a + b < a)
ClosedPublic

Authored by Pierre-vh on Nov 28 2022, 6:57 AM.

Details

Summary

Tries to perform

(lshr (add (zext X), (zext Y)), K)
->  (icmp ult (add X, Y), X)
where
  - The add's operands are zexts from a K-bits integer to a bigger type.
  - The add is only used by the shr, or by iK (or narrower) truncates.
  - The lshr type has more than 2 bits (other types are boolean math).
  - K > 1

This seems to be a pattern that just comes from OpenCL front-ends, so adding DAG/GISel combines doesn't seem to be worth the complexity.

Original patch D107552 by @abinavpp - adapted to use (a + b < a) instead of uaddo following discussion on the review.
See this issue https://github.com/RadeonOpenCompute/ROCm/issues/488

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Pierre-vh updated this revision to Diff 478910.Nov 30 2022, 5:26 AM
Pierre-vh marked an inline comment as done.

Comment

lebedev.ri added inline comments.Nov 30 2022, 5:31 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
853

I think it would be nicer to hoist the match and rename the variable.

Value *Add = I.getOperand(0);
Value *X = nullptr, *Y = nullptr;
if (!match(Add, m_Add(m_Value(X), m_Value(Y))))
  return nullptr;

I don't think there is any requirement for the wider type to be exactly double the narrower type.

That's correct:
https://alive2.llvm.org/ce/z/iLVIgn

So this patch/tests are too narrow as-is. It should be checking something like "if we only demand the top N bits of an add, and the add operands are known zero in those top N bits, then fold the add into an overflow check."

Also, canonicalizing to the add intrinsic if we're not using the add part of the result seems like the wrong direction. I can't tell from the larger test what we're expecting to happen. Please pre-commit the baseline tests, so we can see the diffs.

Pierre-vh marked an inline comment as done.Nov 30 2022, 6:10 AM

I don't think there is any requirement for the wider type to be exactly double the narrower type.

That's correct:
https://alive2.llvm.org/ce/z/iLVIgn

So this patch/tests are too narrow as-is. It should be checking something like "if we only demand the top N bits of an add, and the add operands are known zero in those top N bits, then fold the add into an overflow check."

Also, canonicalizing to the add intrinsic if we're not using the add part of the result seems like the wrong direction. I can't tell from the larger test what we're expecting to happen. Please pre-commit the baseline tests, so we can see the diffs.

Will update the combine & add a base test diff.

Do you mean we shouldn't do the combine if the Add has only one use?

Pierre-vh updated this revision to Diff 478932.Nov 30 2022, 6:24 AM

Since I relaxed the rules on the combine, another test changed.
Not sure if the new conditions are correct, what do you think?

I don't think there is any requirement for the wider type to be exactly double the narrower type.

That's correct:
https://alive2.llvm.org/ce/z/iLVIgn

So this patch/tests are too narrow as-is. It should be checking something like "if we only demand the top N bits of an add, and the add operands are known zero in those top N bits, then fold the add into an overflow check."

Also, canonicalizing to the add intrinsic if we're not using the add part of the result seems like the wrong direction. I can't tell from the larger test what we're expecting to happen. Please pre-commit the baseline tests, so we can see the diffs.

Will update the combine & add a base test diff.

Do you mean we shouldn't do the combine if the Add has only one use?

I think it's the inverse - if the add has only one use, then fold to "not+icmp+zext":
https://github.com/llvm/llvm-project/issues/59232

If the add has >1 use, then I'm not sure what we want to happen. In the general form, we have something like this:
https://alive2.llvm.org/ce/z/sW5BME
...so what other pieces of the pattern need to be there to justify creating the add intrinsic? We're in target-independent InstCombine here, so we don't usually want to end up with more instructions than we started with.

I don't think there is any requirement for the wider type to be exactly double the narrower type.

That's correct:
https://alive2.llvm.org/ce/z/iLVIgn

So this patch/tests are too narrow as-is. It should be checking something like "if we only demand the top N bits of an add, and the add operands are known zero in those top N bits, then fold the add into an overflow check."

Also, canonicalizing to the add intrinsic if we're not using the add part of the result seems like the wrong direction. I can't tell from the larger test what we're expecting to happen. Please pre-commit the baseline tests, so we can see the diffs.

Will update the combine & add a base test diff.

Do you mean we shouldn't do the combine if the Add has only one use?

I think it's the inverse - if the add has only one use, then fold to "not+icmp+zext":
https://github.com/llvm/llvm-project/issues/59232

If the add has >1 use, then I'm not sure what we want to happen. In the general form, we have something like this:
https://alive2.llvm.org/ce/z/sW5BME
...so what other pieces of the pattern need to be there to justify creating the add intrinsic? We're in target-independent InstCombine here, so we don't usually want to end up with more instructions than we started with.

I personally think we can create the add intrinsic if the add has more than one user, and the users are either the a/lshr or truncs (like we check now). I
In the end I'm not sure, to me it looks beneficial but I'll leave the final decision to people with more experience (cc @foad / @arsenm what do you think?)

There's a potentially missing/difficult optimization in the larger example. It boils down to this:
https://alive2.llvm.org/ce/z/qpCq-X

Ie, should we replace a value (the trunc) with a narrow math op that produces the identical value directly? That might be good because it removes a use of the wide add and increases parallelism, but it might be bad because it creates an independent math op that could impede analysis and be more expensive than a trunc in codegen. That's the problem shown in issue #59217 - with a mul, it's pretty clear that we don't want to create more math.

Given that there's no clear answer (and no way to invert the transform that I'm aware of), the direction of this patch is ok with me.

llvm/test/Transforms/InstCombine/shift-add.ll
437–438

This is an awkward way to check if 2 bools are set.
We're missing the reduction of boolean math to logic either way:
https://alive2.llvm.org/ce/z/4dBQhx

648–649

For anything but the i1/i2 case, we should convert the ashr to lshr (as happened here and the next test)?
So we could just bail out of the transform if the type doesn't have at least 3 bits (ignore the possibility of ashr).

Pierre-vh updated this revision to Diff 479179.Dec 1 2022, 12:02 AM
Pierre-vh marked 2 inline comments as done.

Comments

llvm/test/Transforms/InstCombine/shift-add.ll
437–438

I removed support for types <3 bits; should I leave the test case in?

648–649

Ah I didn't know that, I'll simplify the combine to only check lshr then. Thanks!

Thanks.

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
847–848
880

Is there test coverage when the shift amt isn't the half of the original width?

902–905

alive2 proof please?

Pierre-vh updated this revision to Diff 479537.Dec 2 2022, 12:33 AM
Pierre-vh marked 3 inline comments as done.
  • Add new tests (rebased)
  • Fix the combine to check for the EXACT amount of leading zeroes to ensure the transform is correct. Otherwise, if there's too little/too many leading zeroes it could mean that the shift was checking something other than the OV bit, I think.
Pierre-vh added inline comments.Dec 2 2022, 12:57 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
880

lshr_16_to_64_add_zext_basic?
I also added a couple of test cases where the shift amount is lower/higher than what's desired.

902–905

At this stage, the add is only used by either ShAmt-sized truncs, or the shift.
We're removing the shift, and for the truncs, they will cancel out.
https://alive2.llvm.org/ce/z/TWQp29

lebedev.ri added inline comments.Dec 2 2022, 6:02 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
869–871

What we want to know here, is whether both X and Y can be NUW-truncated to ShAmt-wide types.
This is award because we are tiptoeing around adding new QoL helper functions.
We already have this: (ValueTracking.h/cpp)

/// Get the upper bound on bit size for this Value \p Op as a signed integer.
/// i.e.  x == sext(trunc(x to MaxSignificantBits) to bitwidth(x)).
/// Similar to the APInt::getSignificantBits function.
unsigned ComputeMaxSignificantBits(const Value *Op, const DataLayout &DL,
                                   unsigned Depth = 0,
                                   AssumptionCache *AC = nullptr,
                                   const Instruction *CxtI = nullptr,
                                   const DominatorTree *DT = nullptr);

but that is for sign bits, while we want zero bits.
Can you just add an unsigned variant of it next to it, and use it here?

902–905

Right, i'm being slow. This is correct.

908

I think we should just emit it all next to the original add,
there isn't much point in sinking the final zext.

Pierre-vh updated this revision to Diff 480372.Dec 6 2022, 1:04 AM
Pierre-vh marked 2 inline comments as done.

Comments

nikic added a subscriber: nikic.Dec 6 2022, 1:30 AM

FWIW our historical stance has always been that uadd.with.overflow is non-canonical, and the canonical pattern is a + b < a (for non-constant b). uadd.with.overflow generally has worse optimization support, which is why we only form it during CGP for backend purposes.

FWIW our historical stance has always been that uadd.with.overflow is non-canonical, and the canonical pattern is a + b < a (for non-constant b). uadd.with.overflow generally has worse optimization support, which is why we only form it during CGP for backend purposes.

Interesting, not sure what other reviewers think?
Maybe adding a TII hook so targets can enable/disable the combine is a good idea? e.g. something like allowUAddoCanonicalForm?

spatel added a comment.Dec 8 2022, 5:35 AM

FWIW our historical stance has always been that uadd.with.overflow is non-canonical, and the canonical pattern is a + b < a (for non-constant b). uadd.with.overflow generally has worse optimization support, which is why we only form it during CGP for backend purposes.

Interesting, not sure what other reviewers think?
Maybe adding a TII hook so targets can enable/disable the combine is a good idea? e.g. something like allowUAddoCanonicalForm?

We don't want TTI/TLI hooks in InstCombine because it's supposed to be early/target-independent transforms only (although it has folds gated by the data-layout that are almost the same thing as using TTI).

We decided that there is a legitimate need for target-dependent canonicalization though, so that's now possible in AggressiveInstCombine. So moving this patch to that pass and gating the transform on a target hook seems like a non-controversial way forward.

Currently, CodeGenPrepare uses this hook:
https://github.com/llvm/llvm-project/blob/1fe65d866c5285261b7766f2d3930ae975b878ff/llvm/include/llvm/CodeGen/TargetLowering.h#L3086

If we're using that as an early predicate, then I think it should be moved to TargetTransformInfo. I don't see any uses outside of CGP currently.

arsenm added a comment.Dec 8 2022, 5:36 AM

FWIW our historical stance has always been that uadd.with.overflow is non-canonical, and the canonical pattern is a + b < a (for non-constant b). uadd.with.overflow generally has worse optimization support, which is why we only form it during CGP for backend purposes.

Interesting, not sure what other reviewers think?
Maybe adding a TII hook so targets can enable/disable the combine is a good idea? e.g. something like allowUAddoCanonicalForm?

We don't want TTI/TLI hooks in InstCombine because it's supposed to be early/target-independent transforms only (although it has folds gated by the data-layout that are almost the same thing as using TTI).

We decided that there is a legitimate need for target-dependent canonicalization though, so that's now possible in AggressiveInstCombine. So moving this patch to that pass and gating the transform on a target hook seems like a non-controversial way forward.

Currently, CodeGenPrepare uses this hook:
https://github.com/llvm/llvm-project/blob/1fe65d866c5285261b7766f2d3930ae975b878ff/llvm/include/llvm/CodeGen/TargetLowering.h#L3086

If we're using that as an early predicate, then I think it should be moved to TargetTransformInfo. I don't see any uses outside of CGP currently.

I'm more inclined to do this in CGP than AggressiveInstCombine if the shift is the preferred canonical form

arsenm added a comment.Dec 8 2022, 5:37 AM

FWIW our historical stance has always been that uadd.with.overflow is non-canonical, and the canonical pattern is a + b < a (for non-constant b). uadd.with.overflow generally has worse optimization support, which is why we only form it during CGP for backend purposes.

Interesting, not sure what other reviewers think?
Maybe adding a TII hook so targets can enable/disable the combine is a good idea? e.g. something like allowUAddoCanonicalForm?

We don't want TTI/TLI hooks in InstCombine because it's supposed to be early/target-independent transforms only (although it has folds gated by the data-layout that are almost the same thing as using TTI).

We decided that there is a legitimate need for target-dependent canonicalization though, so that's now possible in AggressiveInstCombine. So moving this patch to that pass and gating the transform on a target hook seems like a non-controversial way forward.

Currently, CodeGenPrepare uses this hook:
https://github.com/llvm/llvm-project/blob/1fe65d866c5285261b7766f2d3930ae975b878ff/llvm/include/llvm/CodeGen/TargetLowering.h#L3086

If we're using that as an early predicate, then I think it should be moved to TargetTransformInfo. I don't see any uses outside of CGP currently.

I'm more inclined to do this in CGP than AggressiveInstCombine if the shift is the preferred canonical form

Rather, if we can produce the add and compare as a more canonical form and match that in the backend, that would be better

nikic added a comment.Dec 8 2022, 5:44 AM

FWIW our historical stance has always been that uadd.with.overflow is non-canonical, and the canonical pattern is a + b < a (for non-constant b). uadd.with.overflow generally has worse optimization support, which is why we only form it during CGP for backend purposes.

Interesting, not sure what other reviewers think?
Maybe adding a TII hook so targets can enable/disable the combine is a good idea? e.g. something like allowUAddoCanonicalForm?

I don't think there's any need for target dependence here. You just need to produce a + b < a instead of extract(uaddo(a, b), 1). The uaddo will be formed by the backend.

spatel added a comment.Dec 8 2022, 5:46 AM

Rather, if we can produce the add and compare as a more canonical form and match that in the backend, that would be better

Agreed - if we're just creating add+icmp rather than the intrinsic, then that seems fine to do here.

By a + b < a do you mean that the combine would:

  • Still reduce the add to the smaller type
  • Replace the overflow bit (lshr) with icmp lt (add a, b), a?

Does the backend already transform that to uaddo or will that need a separate patch in the target's CGP?

arsenm added a comment.Dec 8 2022, 6:03 AM

By a + b < a do you mean that the combine would:

  • Still reduce the add to the smaller type
  • Replace the overflow bit (lshr) with icmp lt (add a, b), a?

Does the backend already transform that to uaddo or will that need a separate patch in the target's CGP?

Looks like no: https://godbolt.org/z/nTrzqxeo9

nikic added a comment.Dec 8 2022, 6:05 AM

By a + b < a do you mean that the combine would:

  • Still reduce the add to the smaller type
  • Replace the overflow bit (lshr) with icmp lt (add a, b), a?

Does the backend already transform that to uaddo or will that need a separate patch in the target's CGP?

Looks like no: https://godbolt.org/z/nTrzqxeo9

You are using slt rather than ult.

Pierre-vh updated this revision to Diff 481580.Dec 9 2022, 2:21 AM

Change to (a + b < a); don't combine if ShAmt == 1.

Seems like CGP does the conversion to uaddo; or at least in AMDGPU's case codegen is the same for uaddo/(a + b < a): https://godbolt.org/z/zfnorbvzr

We seem to have more bitwise operations now, is it expected to see the add folded into other operations? (xor/and)
Should I restrict the combine to ShAmt > 2?

Pierre-vh retitled this revision from [InstCombine] Combine a/lshr of add -> uadd.with.overflow to [InstCombine] Combine a/lshr of add -> (a + b < a).Dec 9 2022, 2:22 AM
Pierre-vh edited the summary of this revision. (Show Details)
Pierre-vh updated this revision to Diff 486186.Wed, Jan 4, 12:36 AM

Rebase + ping

nikic added inline comments.Wed, Jan 4, 2:19 AM
llvm/test/Transforms/InstCombine/pr34349.ll
17 ↗(On Diff #486186)

This case doesn't look like a profitable transform. Do I understand correctly that the actual motivating case here is the case where the inputs are zext, and then this was later generalized based on reviewer feedback to use known bits instead?

For the zext case, this looks like an obviously desirable transform, but for the general case (where the truncs may not fold away) this is less clearly beneficial. I would personally restrict this to just zext unless we have specific motivation otherwise. (But I'm also not going to fight this if reviewers disagree.)

Pierre-vh added inline comments.Thu, Jan 5, 2:34 AM
llvm/test/Transforms/InstCombine/pr34349.ll
17 ↗(On Diff #486186)

I indeed did it using known bits after a comment from @spatel:

In the earlier version(s) of this patch (is there a functional diff here from D107552?), there were questions about how it would interact with SCEV and/or vectorization.

If we're focusing on the overflow-only/minimal pattern, then I think we should structure the matching as a known-bits/demanded-bits problem. Ie, instead of zexts, we might have and masks. Instead of a shift at the end, that might also be a mask op.

For me it's fine to do it with just zext but then I'm afraid we'd miss some profitable cases

spatel added inline comments.Thu, Jan 5, 7:25 AM
llvm/test/Transforms/InstCombine/pr34349.ll
17 ↗(On Diff #486186)

If we can get all of the motivating cases by matching zext directly, we can do that as a first step to reduce risk. Then, we can do the more general transform if needed as a second step.

IIUC, this means we could split off the ValueTracking part of the patch to an independent patch (add unit tests if there are no callers currently).

Also, please commit the baseline tests to main as a preliminary patch (D139011), so we can see current diffs (the bool math test diffs should be eliminated after 71df24dd39177ecfc440a0 ?)

Pierre-vh updated this revision to Diff 486815.Fri, Jan 6, 4:34 AM

Rebased tests for proper diff

Before I remove the known bits part and replace it with zext matching I would like @arsenm/@foad to give their opinions too

arsenm added a comment.Fri, Jan 6, 5:12 AM

Rebased tests for proper diff

Before I remove the known bits part and replace it with zext matching I would like @arsenm/@foad to give their opinions too

I think splitting the known bits part as a second step makes sense. The general case multiplies your potential bug surface so I think it's usually better to do this in 2 steps

Pierre-vh retitled this revision from [InstCombine] Combine a/lshr of add -> (a + b < a) to [InstCombine] Combine lshr of add -> (a + b < a).Fri, Jan 6, 5:28 AM
Pierre-vh edited the summary of this revision. (Show Details)
Pierre-vh updated this revision to Diff 486836.Fri, Jan 6, 5:36 AM

Remove knownbits, use zext matching

spatel added inline comments.Fri, Jan 6, 8:42 AM
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
866

We should put m_OneUse() limits on the m_ZExt matches. Also, add a test where the zexts have extra use(s).

That should avoid the +2 instruction count regression in the next patch. We are still potentially increasing instruction count with this transform, but the trade-off seems more reasonable if we can eliminate more of the intermediate instructions.

Pierre-vh updated this revision to Diff 487358.Mon, Jan 9, 2:50 AM
Pierre-vh marked an inline comment as done.

Add m_OneUse to m_ZExt in match

spatel accepted this revision.Mon, Jan 9, 7:58 AM

LGTM

This revision is now accepted and ready to land.Mon, Jan 9, 7:58 AM
This revision was automatically updated to reflect the committed changes.
foad added a comment.Tue, Jan 10, 2:06 AM

FWIW our historical stance has always been that uadd.with.overflow is non-canonical, and the canonical pattern is a + b < a (for non-constant b). uadd.with.overflow generally has worse optimization support, which is why we only form it during CGP for backend purposes.

What's the historical stance on sadd.with.overflow?

llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
848

Likewise I don't think you need this line and the "boolean math" comment belongs on the "K > 1" line.

860

I don't think you need this check since you check below that K != 1 and it matches the width of X and Y, and the shift type must be strictly wider than that. But I guess it's harmless.

nikic added a comment.Tue, Jan 10, 2:10 AM

FWIW our historical stance has always been that uadd.with.overflow is non-canonical, and the canonical pattern is a + b < a (for non-constant b). uadd.with.overflow generally has worse optimization support, which is why we only form it during CGP for backend purposes.

What's the historical stance on sadd.with.overflow?

sadd.with.overflow (and generally, all signed overflow intrinsics) are considered canonical, and we do produce them in InstCombine.

Pierre-vh marked 2 inline comments as done.Wed, Jan 11, 3:53 AM
Pierre-vh added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineShifts.cpp
848

Will address in D141129