If value tracking can confirm that a shift value is less than the type bitwidth then we can more confidently fold general or(shl(a,x),lshr(b,sub(bw,x))) patterns to a funnel/rotate intrinsic pattern
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp | ||
---|---|---|
2097 | // (shl ShVal, L) | (lshr ShVal, (Width - L)) iff L < Width | |
2099–2100 | Either this needs a motivational comment, or this can be dropped. ---------------------------------------- define i64 @src(i64 %x, i64 %y, i64 %a) { %0: %mask = and i64 %a, -1 %shl = shl i64 %x, %mask %sub = sub nsw nuw i64 64, %mask %shr = lshr i64 %y, %sub %r = or i64 %shl, %shr ret i64 %r } => define i64 @tgt(i64 %x, i64 %y, i64 %a) { %0: %r = fshl i64 %x, i64 %y, i64 %a ret i64 %r } Transformation seems to be correct! |
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp | ||
---|---|---|
2099–2100 | We can correctly convert to the intrinsic because the intrinsic makes guarantees about the shift amount (modulo bitwidth) that may not be present in the original IR. But without proper masking/select for UB shift amounts, the expanded code will be worse in codegen. So this is an irreversible transform, and we can't do it universally. So even with the computeKnownBits restriction in this patch, this transform may not be allowed. For example, can we fix this in codegen: $ cat fsh.ll define i32 @fake_fshl(i32 %x, i32 %y, i32 %a) { %mask = and i32 %a, 31 %shl = shl i32 %x, %mask %sub = sub nuw nsw i32 32, %mask %shr = lshr i32 %y, %sub %r = or i32 %shl, %shr ret i32 %r } define i32 @fshl(i32 %x, i32 %y, i32 %a) { %r = call i32 @llvm.fshl.i32(i32 %x, i32 %y, i32 %a) ret i32 %r } declare i32 @llvm.fshl.i32(i32, i32, i32) $ llc -o - fsh.ll -mtriple=riscv64 fake_fshl: # @fake_fshl sllw a0, a0, a2 addi a3, zero, 32 sub a2, a3, a2 srlw a1, a1, a2 or a0, a0, a1 ret fshl: # @fshl andi a2, a2, 31 sll a0, a0, a2 not a2, a2 slli a1, a1, 32 srli a1, a1, 1 srl a1, a1, a2 or a0, a0, a1 ret |
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp | ||
---|---|---|
2099–2100 |
Yep, the intrinsic has an implicit masking, as we can see by said masking being dropped after the transform.
Yep, that is why i'm asking for more blurb. |
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp | ||
---|---|---|
2099–2100 | At least part of the riscv64 issue is due to it deciding to promote the fshl to i64 but not the or+shifts pattern (iirc the sllw/srlw variants are i32 shifts with mod32 that sext the results to i64). SelectionDAG has 15 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t4: i64,ch = CopyFromReg t0, Register:i64 %1 t16: i64 = shl t4, Constant:i64<32> t6: i64,ch = CopyFromReg t0, Register:i64 %2 t21: i64 = and t6, Constant:i64<31> t18: i64 = fshl t2, t16, t21 t13: ch,glue = CopyToReg t0, Register:i64 $x10, t18 t14: ch = RISCVISD::RET_FLAG t13, Register:i64 $x10, t13:1 |
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp | ||
---|---|---|
2099–2100 | We might be able to improve DAGTypeLegalizer::PromoteIntRes_FunnelShift in the case where the promoted type is >= 2* original type to just a shift(or(shl(x,bw),y),urem(z,bw)) pattern unless the promoted fshl/fshr is legal. |
Just so I don't unnecessarily waste time on this - are your main/blocking concerns that backend legalization/lowering is still poor or is it more fundamental than that?
I'm mainly interested in a better spelled-out motivation for the knownbits restriction.
Agree - the default backend expansion seems fine for most targets. If the riscv case is an outlier that can be fixed, then it's probably ok to proceed here (but we should file a bug to raise awareness).
So it is indeed trying to prevent degradation of code quality in case of expansion
(maybe funnel shift intrinsics should simply track whether or not they were originally ub-safe?)
And that highlights my point - we might know that the shift amount is safe not from an explicit mask,
but e.g. from range metadata on load,
which means expansion will still introduce a masking op that wasn't there originally.
How bad performance-wise would that be?
So ValueTracking/SelectionDAG::ComputeKnownBits won't be able to check range data for us?
Aha! I was confused for a moment whether that was info was one of those that does survive into the back-end, and it does:
https://godbolt.org/z/6MWecW
SelectionDAG has 25 nodes: t0: ch = EntryToken t2: i32,ch = CopyFromReg t0, Register:i32 %0 t4: i32,ch = CopyFromReg t0, Register:i32 %1 t12: i64 = build_pair t2, t4 t6: i32,ch = CopyFromReg t0, Register:i32 %2 t8: i32,ch = CopyFromReg t0, Register:i32 %3 t13: i64 = build_pair t6, t8 t11: i32,ch = load<(load 4 from %fixed-stack.0)> t0, FrameIndex:i32<-1>, undef:i32 t15: i64,ch = load<(load 8 from %ir.aptr, !range !0)> t0, t11, undef:i32 t16: i64 = fshl t12, t13, t15 t19: i32 = extract_element t16, Constant:i32<0> t21: ch,glue = CopyToReg t0, Register:i32 $r0, t19 t18: i32 = extract_element t16, Constant:i32<1> t23: ch,glue = CopyToReg t21, Register:i32 $r1, t18, t21:1 t24: ch = ARMISD::RET_FLAG t23, Register:i32 $r0, Register:i32 $r1, t23:1
So this seems fine to me.
Alive2 complains about one of the test cases:
define i64 @fshr_sub_mask(i64 %x, i64 %y, i64 %a) { %mask = and i64 %a, 63 %shr = lshr i64 %x, %mask %sub = sub nsw nuw i64 64, %mask %shl = shl i64 %y, %sub %r = or i64 %shl, %shr ret i64 %r } => define i64 @fshr_sub_mask(i64 %x, i64 %y, i64 %a) { %r = fshr i64 %x, i64 %y, i64 %a ret i64 %r } Transformation doesn't verify! ERROR: Value mismatch Example: i64 %x = #x0000100000000000 (17592186044416) i64 %y = #x0000000000000000 (0) i64 %a = #x0000000000000027 (39) Source: i64 %mask = #x0000000000000027 (39) i64 %shr = #x0000000000000020 (32) i64 %sub = #x0000000000000019 (25) i64 %shl = #x0000000000000000 (0) i64 %r = #x0000000000000020 (32) Target: i64 %r = #x0000000000000000 (0) Source value: #x0000000000000020 (32) Target value: #x0000000000000000 (0)
This broke armv7 (tested with both armv7-w64-mingw32 and armv7-linux-gnueabihf) builds of compiler-rt's __udivmoddi4 function; after this commit, it miscalculates __udivmoddi4(883547321287490176, 128).
Thanks @nlopes it looks like the args aren't getting swapped for some reason - I'll investigate
// (shl ShVal, L) | (lshr ShVal, (Width - L)) iff L < Width