This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Extend canonicalizeClampLike to handle truncated inputs
ClosedPublic

Authored by dmgreen on Aug 13 2021, 11:59 AM.

Details

Summary

This extends the canonicalizeClampLike function to allow cases where the input is truncated, but still matching on the types of the ICmps. For example

%t = trunc i32 %X to i8
%a = add i32 %X, 128
%cmp = icmp ult i32 %a, 256
%c = icmp sgt i32 %X, -1
%f = select i1 %c, i8 High, i8 Low
%r = select i1 %cmp, i8 %t, i8 %f

becomes

%c1 = icmp slt i32 %X, -128
%c2 = icmp sge i32 %X, 128
%s1 = select i1 %c1, i32 sext(Low), i32 %X
%s2 = select i1 %c2, i32 sext(High), i32 %s1
%t = trunc i32 %s2 to i8

https://alive2.llvm.org/ce/z/vPzfxH

We limit the transform to constant High and Low values, where we know the sext are free.

Diff Detail

Event Timeline

dmgreen created this revision.Aug 13 2021, 11:59 AM
dmgreen requested review of this revision.Aug 13 2021, 11:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 13 2021, 11:59 AM

vector test cases?

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
922

If you're going to keep this in code, please can you simplify it (no entry: + better var names).

934

How is BW different to Ty->getScalarSizeInBits() ?

dmgreen updated this revision to Diff 367156.Aug 18 2021, 2:48 AM

Thanks for taking a look. This does some cleanup as suggested, and renames the method to be consistent with others nearby.

There is a vector test in testv4i16i8. Is that what you had in mind, or would something more be preferable?

dmgreen updated this revision to Diff 368040.Aug 23 2021, 12:09 AM
dmgreen edited the summary of this revision. (Show Details)

More cleanup

dmgreen updated this revision to Diff 369500.Aug 30 2021, 11:25 AM

Rebase and ping. Any opinions on here vs aggressive instrcombine?

That's a big fold!
The larger the pattern match, the more fragile the optimization tends to be because we might eventually find sub-patterns that can be reduced.
Does it make things harder or easier if we fold that icmp? We're checking if the top N/2 + 1 bits are all set or clear, so I visualized it like this:

define i1 @src(i16 %x) {
  %t0 = lshr i16 %x, 8
  %conv.i = trunc i16 %t0 to i8
  %conv1.i = trunc i16 %x to i8
  %shr2.i = ashr i8 %conv1.i, 7
  %r = icmp eq i8 %shr2.i, %conv.i
  ret i1 %r
}

define i1 @tgt(i16 %x) {
  %mask = ashr i16 %x, 7
  %ones = icmp eq i16 %mask, -1
  %zero = icmp eq i16 %mask, 0
  %r = or i1 %ones, %zero 
  ret i1 %r
}

https://alive2.llvm.org/ce/z/reQjDv

But existing combines get that down to just 2 instructions:

define i1 @tgt(i16 %x) {
  %x.off = add i16 %x, 65408
  %r = icmp ugt i16 %x.off, 65279
  ret i1 %r
}

https://alive2.llvm.org/ce/z/8Fh23s

I don't know exactly what the generalization for this will be, but it seems like we should try that first?

That's a big fold!
The larger the pattern match, the more fragile the optimization tends to be because we might eventually find sub-patterns that can be reduced.

Yeah OK, Sounds good. I'll try it as a number of smaller folds, like you say it should be more reliable and that way the final pattern should at least be smaller. I half remember from a very long time ago trying the same thing (including extends/add to make a sadd.sat pattern), but couldn't do it without some combines that individually increased instruction count (or large combines).

I can give it another go though. From looking at it again, this gets us there, but the last fold involves moving truncates in the wrong direction:
https://godbolt.org/z/dKjdfhe1s

I'll try and figure out if there's anything more sensible to do as that last fold. Suggestions welcome if you have any :)

The problem with smaller intermediate folds is that the larger final fold
may still be wanted iff the intermediate folds may be blocked due to the use counts..

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
925–926

I'm not sure bitwidth cutoff should be here?
It should be easy to legalize in backend, in fact it already has to.

The problem with smaller intermediate folds is that the larger final fold
may still be wanted iff the intermediate folds may be blocked due to the use counts..

Yes - that reminds me of the min/max pixel patterns from cmyk benchmarks. We ended up needing to match a larger-than-normal pattern to get those because of uses. (I'm trying to deal with the intrinsic versions of those patterns now to help D98152.)
So it depends on the motivating case here - if there are enough extra uses that we can't get the big pattern, then we might as well add this. If not, we can go for smaller matches and confirm that we get the sequence to work on the larger example(s).

dmgreen updated this revision to Diff 370557.Sep 3 2021, 6:07 AM
dmgreen retitled this revision from [InstCombine] Canonicalize saturate with shift and xor to min/max clamp to [InstCombine] Extend canonicalizeClampLike to handle truncated inputs.
dmgreen edited the summary of this revision. (Show Details)

Change to now extend canonicalizeClampLike to handle truncated inputs, which further gets folded to a saturating min/max pattern. With the other combines added in D109151 and D109155 the original case becomes a sadd.sat.

dmgreen updated this revision to Diff 377621.Oct 6 2021, 11:36 AM

Rebase onto main, making sure this has some testing independent of other patches.

lebedev.ri added inline comments.Oct 15 2021, 7:19 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1441

We didn't need sext originally. Why do we always need one now? Should this use CreateSExtOrBitCast()?

1445–1452

I would recommend something along the lines of using CreateSExtOrBitCast(),
only calling CreateSelect in a single place, and returning through CreateTruncOrBitCast().

dmgreen updated this revision to Diff 380014.Oct 15 2021, 8:33 AM

Rejig.

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1441

It is relying on the CreateSExt doing if (V->getType() == DestTy) return V;. I can change that to CreateSExtOrBitCast but It shouldn't ever bitcast anything, the types will always be integers.

1445–1452

Oh yeah I was awkwardly working around it returning an intruction. I've changed it, let be know if it looks wrong or they should be changed to OrBitcast versions.

I notice that all the changed tests perform signed clamping,
and likewise, you sext. Is sext always the right choice?
Please add at least one test with unsigned clamp,
and an alive proof.

I notice that all the changed tests perform signed clamping,
and likewise, you sext. Is sext always the right choice?
Please add at least one test with unsigned clamp,
and an alive proof.

This method wont performs unsigned clamping (depending on what unsigned clamping means). It transforms code of the form:
select (icmp ult X, C0), add(X, C1), select(icmp slt(X, C2), L, H))
And always produces signed clamp outputs. So there is an unsigned comparison in there (and a signed one), but they are not interchangable.

I can change either the first icmp to signed, or the second to unsigned, but then the method wont match them and this patch doesn't alter the codegen. I'll add them to the list of tests, but the codegen here won't change

define i16 @testi32i16i8_s(i32 %add) {
; CHECK-LABEL: @testi32i16i8_s(
; CHECK-NEXT:    [[A:%.*]] = add i32 [[ADD:%.*]], 128
; CHECK-NEXT:    [[CMP:%.*]] = icmp slt i32 [[A]], 256
; CHECK-NEXT:    [[T:%.*]] = trunc i32 [[ADD]] to i16
; CHECK-NEXT:    [[C:%.*]] = icmp sgt i32 [[ADD]], -1
; CHECK-NEXT:    [[F:%.*]] = select i1 [[C]], i16 127, i16 -128
; CHECK-NEXT:    [[R:%.*]] = select i1 [[CMP]], i16 [[T]], i16 [[F]]
; CHECK-NEXT:    ret i16 [[R]]
;
  %a = add i32 %add, 128
  %cmp = icmp slt i32 %a, 256
  %t = trunc i32 %add to i16
  %c = icmp sgt i32 %add, -1
  %f = select i1 %c, i16 127, i16 -128
  %r = select i1 %cmp, i16 %t, i16 %f
  ret i16 %r
}

define i16 @testi32i16i8_u(i32 %add) {
; CHECK-LABEL: @testi32i16i8_u(
; CHECK-NEXT:    [[A:%.*]] = add i32 [[ADD:%.*]], 128
; CHECK-NEXT:    [[CMP:%.*]] = icmp ult i32 [[A]], 256
; CHECK-NEXT:    [[T:%.*]] = trunc i32 [[ADD]] to i16
; CHECK-NEXT:    [[R:%.*]] = select i1 [[CMP]], i16 [[T]], i16 -128
; CHECK-NEXT:    ret i16 [[R]]
;
  %a = add i32 %add, 128
  %cmp = icmp ult i32 %a, 256
  %t = trunc i32 %add to i16
  %c = icmp ugt i32 %add, -1
  %f = select i1 %c, i16 127, i16 -128
  %r = select i1 %cmp, i16 %t, i16 %f
  ret i16 %r
}

This is a proof that includes the sexts: https://alive2.llvm.org/ce/z/Y_Q3yQ

Uhm, okay. Then, could you please post the generalized (with variables) alive2 proof for ULT+SLT predicate pair?

Uhm, okay. Then, could you please post the generalized (with variables) alive2 proof for ULT+SLT predicate pair?

Hmm, What do you mean? The proof above is ult+slt. Do you mean in some other way?

Uhm, okay. Then, could you please post the generalized (with variables) alive2 proof for ULT+SLT predicate pair?

Hmm, What do you mean? The proof above is ult+slt. Do you mean in some other way?

I mean it is for a hardcoded set of constants, while i'd like to see it's general form.

Here's what i wanted to see: https://alive2.llvm.org/ce/z/2XpHkB
Did i mess that proof up, or are some new preconditions needed?

Here's what i wanted to see: https://alive2.llvm.org/ce/z/2XpHkB
Did i mess that proof up, or are some new preconditions needed?

Oh I see. It appears to need a check that the C0 in icmp ult %a, C0 isn't zero, to prevent undef being an issue, unless I got this wrong:
https://alive2.llvm.org/ce/z/Fxjsv6
Same thing for the original case:
https://alive2.llvm.org/ce/z/BBdqhZ

It sounds fine to me to assume that the icmp ult %a, 0 will have already been simplified away, but let me know if you think an extra check should be added in case.

Aha, so the initial transformation is already a miscompile:
https://alive2.llvm.org/ce/z/AsQQBJ

Could you please fix it first?
You need to be freezing the %x: https://alive2.llvm.org/ce/z/ekNz8E

I don't think we need freeze for the general case, just that C0 != 0 (which will pretty much always be true as icmp ult i32 %t2, 0 isn't a very useful thing to check :) )
https://alive2.llvm.org/ce/z/VNnGSy

I'll add an explicit check for it though.

dmgreen updated this revision to Diff 380162.Oct 16 2021, 3:19 AM

Added a check for C0 being zero. This is the proof for the UGT part:
https://alive2.llvm.org/ce/z/tr8drB

I'm not sure if this case is possible to test. The icmp will usually have been simplified away by the time we visit this pattern.

lebedev.ri added inline comments.Oct 21 2021, 4:55 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1310–1324

What happens if one element of a constant vector is all-ones/zero?

dmgreen updated this revision to Diff 381243.Oct 21 2021, 6:39 AM

The constants aren't splats! Switch back to using the m_SpecificInt_ICMP method for checking constant elements.

Any further comments?

Sorry, this should not be taking so long.

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1350–1351

match(&X, m_TruncOrSelf(m_Value(X))); ?

1426

This accepts constant exprs too. You probably want m_ImmConstant() matcher.

1448

Hm, so this now creates one more instruction than before,
but i'm not seeing one more one-use check being added.
Perhaps the original trunc should be one-use?

dmgreen updated this revision to Diff 382985.Oct 28 2021, 4:32 AM

Use m_ImmConstant, match(X, m_TruncOrSelf(m_Value(X))) (I hope that's OK) and add a one use check for the truncate.

lebedev.ri accepted this revision.Oct 28 2021, 4:35 AM

LG to me, unless @spatel has further thoughts.

This revision is now accepted and ready to land.Oct 28 2021, 4:35 AM
spatel added inline comments.Oct 28 2021, 6:20 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
1425

It would be clearer to follow the optional casting logic if we consolidate it one block with something like this:

if (X->getType() != Sel0.getType()) {
  Constant *LowC, *HighC;
  if (!match(ReplacementLow, m_ImmConstant(LowC)) ||
      !match(ReplacementHigh, m_ImmConstant(HighC)))
    return nullptr;
  ReplacementLow = ConstantExpr::getSExt(LowC, X->getType());
  ReplacementHigh = ConstantExpr::getSExt(HighC, X->getType());
}

...and then get rid of the CreateSext diffs below here?

llvm/test/Transforms/InstCombine/truncating-saturate.ll
665

Double-checking my understanding: this is a miscompile and is fixed by the new ULT check (m_SpecificInt_ICMP)? If so, let's commit that fix first?

dmgreen updated this revision to Diff 383025.Oct 28 2021, 7:07 AM

Thanks - updated as per comments along with rG79011c705b58 for the initial part.

lebedev.ri accepted this revision.Oct 28 2021, 7:10 AM
spatel accepted this revision.Oct 28 2021, 7:31 AM

LGTM

This revision was landed with ongoing or failed builds.Oct 28 2021, 7:47 AM
This revision was automatically updated to reflect the committed changes.

this is causing crashes, e.g.

$ cat reduced.ll 
define i8 @f(i32 %value, i8 %call.i) {
entry:
  %cmp.i = icmp slt i32 %value, 0
  %cond.i = select i1 %cmp.i, i8 %call.i, i8 0
  %cmp.i.i = icmp ult i32 %value, 256
  %conv4 = trunc i32 %value to i8
  %cond = select i1 %cmp.i.i, i8 %conv4, i8 %cond.i
  ret i8 %cond
}
$ ./build/rel/bin/opt -passes=instcombine -disable-output reduced.ll
opt: ../../llvm/lib/IR/Constants.cpp:2304: static llvm::Constant *llvm::ConstantExpr::get(unsigned int, llvm::Constant *, llvm::Constant *, unsigned int, llvm::Type *): Assertion `C1->getType() == C2->getType() && "Operand types in binary constant expression should match"' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ./build/rel/bin/opt -passes=instcombine -disable-output reduced.ll
 #0 0x0000000001f1a023 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /usr/local/google/home/aeubanks/repos/llvm-project/build/rel/../../llvm/lib/Support/Unix/Signals.inc:565:13
 #1 0x0000000001f17e9e llvm::sys::RunSignalHandlers() /usr/local/google/home/aeubanks/repos/llvm-project/build/rel/../../llvm/lib/Support/Signals.cpp:98:18
 #2 0x0000000001f1a38f SignalHandler(int) /usr/local/google/home/aeubanks/repos/llvm-project/build/rel/../../llvm/lib/Support/Unix/Signals.inc:407:1
 #3 0x00007f45c95198e0 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x138e0)
 #4 0x00007f45c8ff3e71 raise ./signal/../sysdeps/unix/sysv/linux/raise.c:50:1
 #5 0x00007f45c8fdd536 abort ./stdlib/abort.c:81:7
 #6 0x00007f45c8fdd41f get_sysdep_segment_value ./intl/loadmsgcat.c:509:8
 #7 0x00007f45c8fdd41f _nl_load_domain ./intl/loadmsgcat.c:970:34
 #8 0x00007f45c8fec7f2 (/lib/x86_64-linux-gnu/libc.so.6+0x357f2)
 #9 0x0000000001ba8eed llvm::ConstantExpr::get(unsigned int, llvm::Constant*, llvm::Constant*, unsigned int, llvm::Type*) /usr/local/google/home/aeubanks/repos/llvm-project/build/rel/../../llvm/lib/IR/Constants.cpp:0:0
#10 0x0000000002929f7e canonicalizeClampLike /usr/local/google/home/aeubanks/repos/llvm-project/build/rel/../../llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp:1409:29
#11 0x0000000002929f7e llvm::InstCombinerImpl::foldSelectInstWithICmp(llvm::SelectInst&, llvm::ICmpInst*) /usr/local/google/home/aeubanks/repos/llvm-project/build/rel/../../llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp:1535:18
#12 0x000000000292f3b2 llvm::InstCombinerImpl::visitSelectInst(llvm::SelectInst&) /usr/local/google/home/aeubanks/repos/llvm-project/build/rel/../../llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp:2994:31

A revert is causing llvm/test/Transforms/InstCombine/truncating-saturate.ll to fail, could you fix forward or perhaps revert whatever stack of patches needs to be reverted?

A revert is causing llvm/test/Transforms/InstCombine/truncating-saturate.ll to fail, could you fix forward or perhaps revert whatever stack of patches needs to be reverted?

Thanks for the report. I'll fix it now.