This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] matchFunnelShift - fold or(shl(a,x),lshr(b,sub(bw,x))) -> fshl(a,b,x) iff x < bw
ClosedPublic

Authored by RKSimon on Oct 3 2020, 9:24 AM.

Details

Summary

If value tracking can confirm that a shift value is less than the type bitwidth then we can more confidently fold general or(shl(a,x),lshr(b,sub(bw,x))) patterns to a funnel/rotate intrinsic pattern

Diff Detail

Event Timeline

RKSimon created this revision.Oct 3 2020, 9:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2020, 9:24 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
RKSimon requested review of this revision.Oct 3 2020, 9:24 AM
RKSimon updated this revision to Diff 296641.Oct 7 2020, 4:37 AM

rebase - pass InstCombinerImpl into matchRotate

RKSimon updated this revision to Diff 296911.Oct 8 2020, 3:19 AM
RKSimon retitled this revision from [InstCombine] matchRotate - fold or(shl(v,x),lshr(v,bw-x)) -> fshl(v,v,x) iff x < bw to [InstCombine] matchFunnelShift - fold or(shl(a,x),lshr(b,sub(bw,x))) -> fshl(a,b,x) iff x < bw.

Add generic funnel support now that D88834 has landed

lebedev.ri requested changes to this revision.Oct 8 2020, 4:00 AM
lebedev.ri added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2097

// (shl ShVal, L) | (lshr ShVal, (Width - L)) iff L < Width

2099–2100

Either this needs a motivational comment, or this can be dropped.

----------------------------------------
define i64 @src(i64 %x, i64 %y, i64 %a) {
%0:
  %mask = and i64 %a, -1
  %shl = shl i64 %x, %mask
  %sub = sub nsw nuw i64 64, %mask
  %shr = lshr i64 %y, %sub
  %r = or i64 %shl, %shr
  ret i64 %r
}
=>
define i64 @tgt(i64 %x, i64 %y, i64 %a) {
%0:
  %r = fshl i64 %x, i64 %y, i64 %a
  ret i64 %r
}
Transformation seems to be correct!
This revision now requires changes to proceed.Oct 8 2020, 4:00 AM
spatel added inline comments.Oct 8 2020, 5:40 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2099–2100

We can correctly convert to the intrinsic because the intrinsic makes guarantees about the shift amount (modulo bitwidth) that may not be present in the original IR.

But without proper masking/select for UB shift amounts, the expanded code will be worse in codegen. So this is an irreversible transform, and we can't do it universally.

So even with the computeKnownBits restriction in this patch, this transform may not be allowed. For example, can we fix this in codegen:
https://alive2.llvm.org/ce/z/IsuRoN

$ cat fsh.ll 
define i32 @fake_fshl(i32 %x, i32 %y, i32 %a) {
  %mask = and i32 %a, 31
  %shl = shl i32 %x, %mask
  %sub = sub nuw nsw i32 32, %mask
  %shr = lshr i32 %y, %sub
  %r = or i32 %shl, %shr
  ret i32 %r
}

define i32 @fshl(i32 %x, i32 %y, i32 %a) {
  %r = call i32 @llvm.fshl.i32(i32 %x, i32 %y, i32 %a)
  ret i32 %r
}

declare i32 @llvm.fshl.i32(i32, i32, i32)

$ llc -o - fsh.ll -mtriple=riscv64
fake_fshl:                              # @fake_fshl
	sllw	a0, a0, a2
	addi	a3, zero, 32
	sub	a2, a3, a2
	srlw	a1, a1, a2
	or	a0, a0, a1
	ret
fshl:                                   # @fshl
	andi	a2, a2, 31
	sll	a0, a0, a2
	not	a2, a2
	slli	a1, a1, 32
	srli	a1, a1, 1
	srl	a1, a1, a2
	or	a0, a0, a1
	ret
lebedev.ri added inline comments.Oct 8 2020, 6:02 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2099–2100

We can correctly convert to the intrinsic because the intrinsic makes guarantees about the shift amount (modulo bitwidth) that may not be present in the original IR.

Yep, the intrinsic has an implicit masking, as we can see by said masking being dropped after the transform.

But without proper masking/select for UB shift amounts, the expanded code will be worse in codegen. So this is an irreversible transform, and we can't do it universally.

Yep, that is why i'm asking for more blurb.
The problematic case in mind is, what we were able to proove that the value is limited because we had range info on a load?

RKSimon added inline comments.Oct 8 2020, 6:39 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2099–2100

At least part of the riscv64 issue is due to it deciding to promote the fshl to i64 but not the or+shifts pattern (iirc the sllw/srlw variants are i32 shifts with mod32 that sext the results to i64).

SelectionDAG has 15 nodes:
  t0: ch = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t4: i64,ch = CopyFromReg t0, Register:i64 %1
      t16: i64 = shl t4, Constant:i64<32>
        t6: i64,ch = CopyFromReg t0, Register:i64 %2
      t21: i64 = and t6, Constant:i64<31>
    t18: i64 = fshl t2, t16, t21
  t13: ch,glue = CopyToReg t0, Register:i64 $x10, t18
  t14: ch = RISCVISD::RET_FLAG t13, Register:i64 $x10, t13:1
RKSimon added inline comments.Oct 8 2020, 7:00 AM
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
2099–2100

We might be able to improve DAGTypeLegalizer::PromoteIntRes_FunnelShift in the case where the promoted type is >= 2* original type to just a shift(or(shl(x,bw),y),urem(z,bw)) pattern unless the promoted fshl/fshr is legal.

Just so I don't unnecessarily waste time on this - are your main/blocking concerns that backend legalization/lowering is still poor or is it more fundamental than that?

Just so I don't unnecessarily waste time on this - are your main/blocking concerns that backend legalization/lowering is still poor or is it more fundamental than that?

I'm mainly interested in a better spelled-out motivation for the knownbits restriction.

spatel added a comment.Oct 9 2020, 8:48 AM

Just so I don't unnecessarily waste time on this - are your main/blocking concerns that backend legalization/lowering is still poor or is it more fundamental than that?

I'm mainly interested in a better spelled-out motivation for the knownbits restriction.

Agree - the default backend expansion seems fine for most targets. If the riscv case is an outlier that can be fixed, then it's probably ok to proceed here (but we should file a bug to raise awareness).

RKSimon updated this revision to Diff 297257.Oct 9 2020, 9:21 AM
RKSimon edited the summary of this revision. (Show Details)

Improve comment describing the limits of the fold.

Improve comment describing the limits of the fold.

So it is indeed trying to prevent degradation of code quality in case of expansion
(maybe funnel shift intrinsics should simply track whether or not they were originally ub-safe?)

And that highlights my point - we might know that the shift amount is safe not from an explicit mask,
but e.g. from range metadata on load,
which means expansion will still introduce a masking op that wasn't there originally.

How bad performance-wise would that be?

And that highlights my point - we might know that the shift amount is safe not from an explicit mask, but e.g. from range metadata on load, which means expansion will still introduce a masking op that wasn't there originally.

So ValueTracking/SelectionDAG::ComputeKnownBits won't be able to check range data for us?

lebedev.ri accepted this revision.Oct 9 2020, 9:57 AM

Aha! I was confused for a moment whether that was info was one of those that does survive into the back-end, and it does:
https://godbolt.org/z/6MWecW

SelectionDAG has 25 nodes:
  t0: ch = EntryToken
      t2: i32,ch = CopyFromReg t0, Register:i32 %0
      t4: i32,ch = CopyFromReg t0, Register:i32 %1
    t12: i64 = build_pair t2, t4
      t6: i32,ch = CopyFromReg t0, Register:i32 %2
      t8: i32,ch = CopyFromReg t0, Register:i32 %3
    t13: i64 = build_pair t6, t8
      t11: i32,ch = load<(load 4 from %fixed-stack.0)> t0, FrameIndex:i32<-1>, undef:i32
    t15: i64,ch = load<(load 8 from %ir.aptr, !range !0)> t0, t11, undef:i32
  t16: i64 = fshl t12, t13, t15
    t19: i32 = extract_element t16, Constant:i32<0>
  t21: ch,glue = CopyToReg t0, Register:i32 $r0, t19
    t18: i32 = extract_element t16, Constant:i32<1>
  t23: ch,glue = CopyToReg t21, Register:i32 $r1, t18, t21:1
  t24: ch = ARMISD::RET_FLAG t23, Register:i32 $r0, Register:i32 $r1, t23:1

So this seems fine to me.

This revision is now accepted and ready to land.Oct 9 2020, 9:57 AM
nlopes added a subscriber: nlopes.Oct 11 2020, 3:30 PM

Alive2 complains about one of the test cases:

define i64 @fshr_sub_mask(i64 %x, i64 %y, i64 %a) {
  %mask = and i64 %a, 63
  %shr = lshr i64 %x, %mask
  %sub = sub nsw nuw i64 64, %mask
  %shl = shl i64 %y, %sub
  %r = or i64 %shl, %shr
  ret i64 %r
}
=>
define i64 @fshr_sub_mask(i64 %x, i64 %y, i64 %a) {
  %r = fshr i64 %x, i64 %y, i64 %a
  ret i64 %r
}
Transformation doesn't verify!
ERROR: Value mismatch

Example:
i64 %x = #x0000100000000000 (17592186044416)
i64 %y = #x0000000000000000 (0)
i64 %a = #x0000000000000027 (39)

Source:
i64 %mask = #x0000000000000027 (39)
i64 %shr = #x0000000000000020 (32)
i64 %sub = #x0000000000000019 (25)
i64 %shl = #x0000000000000000 (0)
i64 %r = #x0000000000000020 (32)

Target:
i64 %r = #x0000000000000000 (0)
Source value: #x0000000000000020 (32)
Target value: #x0000000000000000 (0)

https://web.ist.utl.pt/nuno.lopes/alive2/index.php?hash=65e89ee9edd0ccba&test=Transforms%2FInstCombine%2Ffunnel.ll

This broke armv7 (tested with both armv7-w64-mingw32 and armv7-linux-gnueabihf) builds of compiler-rt's __udivmoddi4 function; after this commit, it miscalculates __udivmoddi4(883547321287490176, 128).

Alive2 complains about one of the test cases:

Thanks @nlopes it looks like the args aren't getting swapped for some reason - I'll investigate

@nlopes @mstorsjo Thanks for the regression reports - I've recommitted the patch now, so it should now pass.

Thanks! Now it does indeed seem to work, at least so far.