This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] convert mask and shift of power-of-2 to cmp+select
ClosedPublic

Authored by spatel on Jun 14 2022, 2:45 PM.

Details

Summary

When the mask is a power-of-2 constant and op0 is a shifted-power-of-2 constant, test if the shift amount equals the offset bit index:

(ShiftC << X) & C --> X == (log2(C) - log2(ShiftC)) ? C : 0
(ShiftC >> X) & C --> X == (log2(ShiftC) - log2(C)) ? C : 0

This is an alternate to D127610 with a more general pattern. We match only shift+and instead of the trailing xor, so we see a few more tests diffs. I think we discussed this initially in D126617.

Here are proofs for shifts in both directions:
https://alive2.llvm.org/ce/z/CFrLs4

The test diffs look equal or better for IR, and this makes the patterns more uniform in IR. The backend would likely invert this in both cases if that is profitable, and there's already code in place to do that.

Diff Detail

Event Timeline

spatel created this revision.Jun 14 2022, 2:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2022, 2:45 PM
spatel requested review of this revision.Jun 14 2022, 2:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2022, 2:45 PM

I agree with that the transform is more general. And my question is how to invert the transform on backend if shift + and is faster.
For now SelectionDag has the transform select Cond, Pow2, 0 --> (zext Cond) << log2(Pow2)

// select Cond, Pow2, 0 --> (zext Cond) << log2(Pow2)
if (C1Val.isPowerOf2() && C2Val.isZero()) {
  if (VT != MVT::i1)
    Cond = DAG.getNode(ISD::ZERO_EXTEND, DL, VT, Cond);
  SDValue ShAmtC =
      DAG.getShiftAmountConstant(C1Val.exactLogBase2(), VT, DL);
  return DAG.getNode(ISD::SHL, DL, VT, Cond, ShAmtC);
}

And if we want to do this
X == (log2(C) - log2(ShiftC)) ? C : 0 --> (ShiftC << X) & C
X == (log2(ShiftC) - log2(C)) ? C : 0 --> (ShiftC >> X) & C
We need to assume X < BitWidth https://alive2.llvm.org/ce/z/-GjVsu

We need to assume X < BitWidth https://alive2.llvm.org/ce/z/-GjVsu

Ah, yes good point. The expansion back to the shift form requires an extra safety check for an over-shift in IR. Here's another version of the proof to show that:
https://alive2.llvm.org/ce/z/E_KCpi

And this is a hard-coded example of what the backend does currently when it creates a shift:
https://alive2.llvm.org/ce/z/6EfMnj

But in the backend, the target can specify the exact behavior for an over-shift - it does not have to be undefined (create poison). If a target returns 0 for an over-shift, then no masking or extra cmp+select may be needed.

Do you have an example for a target where the code looks worse with the cmp+select IR?

I tried some tests with x86 and AArch, and they do not look worse to me with the cmp+select IR. Here is an example:

define i32 @src(i32 %x) {
  %shl = shl i32 2, %x
  %r = and i32 %shl, 64
  ret i32 %r
}

define i32 @tgt(i32 %x) {
  %i = icmp eq i32 %x, 5
  %r = select i1 %i, i32 64, i32 0
  ret i32 %r
}
% llc -o - sh2.ll -mtriple=x86_64 
src: 
	movl	%edi, %ecx
	movl	$2, %eax
	shll	%cl, %eax
	andl	$64, %eax

tgt:  
	xorl	%eax, %eax
	cmpl	$5, %edi
	sete	%al
	shll	$6, %eax
% llc -o - sh2.ll -mtriple=aarch64
src:                    
	mov	w8, #2
	lsl	w8, w8, w0
	and	w0, w8, #0x40

tgt:        
	cmp	w0, #5
	cset	w8, eq
	lsl	w0, w8, #6

Actually even if the instruction number are the same, we still prefer shift+and because cmp port is less than shift for most modern CPU I think.

Do you have an example for a target where the code looks worse with the cmp+select IR?

And I try to create 2 cases: https://alive2.llvm.org/ce/z/exvD7T

I agree that they have very little difference generally.
I also agree that we shouldn't consider too much CPU micro arch for a IR transform.
Maybe we can add metadata MD_range for x to keep the information then backend can get enough information to invert the transform.

Actually even if the instruction number are the same, we still prefer shift+and because cmp port is less than shift for most modern CPU I think.

Yes, some targets may have better throughput, but I don't think there is enough difference to change the decision in IR.

As a counter-example, consider the horrible result for the shift+and pattern with x86 vectors (a target that may not have good support for variable shifts):
https://alive2.llvm.org/ce/z/ciMo-S

So we just have to decide if this canonicalization is worthwhile to improve IR. The backend has to do some work either way to produce optimal code for all targets.

And I try to create 2 cases: https://alive2.llvm.org/ce/z/exvD7T

Thanks - I understand your point. Although for RISCV64, I don't think we should see the sign extensions in a normal example from source:
https://godbolt.org/z/KWEPPTsx1

Maybe we can add metadata MD_range for x to keep the information then backend can get enough information to invert the transform.

I haven't looked at metadata in a long time, so I don't know if it is practical. Range metadata is limited on where it can be added (for example, not on function parameters?), and then we have to make sure it is still available for the backend to use.

Since we made similar transforms before this one, this seemed like a good canonicalization to me, but if you think there's too much risk for the backend, we can abandon. If we do that, then we'd likely want to recreate this transform in SDAG, so we can still improve examples like the x86 code I showed in the previous link.

Thanks for the x86 counter example that I haven't thought about.
My point is backend can do this transform also. But it is hard to invert it if we canonicalize IR shift+and to icmp+select.
I think we need someone else help to review the discussion @RKSimon @nikic @craig.topper.

nikic accepted this revision.Jun 16 2022, 12:55 AM

LG from my side. I don't think the differences in backend codegen are bad enough that we would even bother with a reverse transform at this point. We can reevaluate matching a more complex pattern if it becomes an issue.

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1917

op0 -> Op0

This revision is now accepted and ready to land.Jun 16 2022, 12:55 AM

Thanks - let's give this a try.
It seems like it can help IR analysis, and there are potential wins for codegen too. If we find perf regressions that are not easy to reverse, then we can revert.

This revision was landed with ongoing or failed builds.Jun 17 2022, 7:52 AM
This revision was automatically updated to reflect the committed changes.