Many targets have optimizations for encoding constants of a repeated
pattern. Sign extension is one of the more trivial/common ones so
avoid breaking it.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
My first question here would be whether we should be undoing this in the backend instead. We perform demanded bits analysis on SDAG, and there is an existing targetShrinkDemandedConstant() hook. The name suggests that it should shrink the constant, but I see no reason why it couldn't do the reverse if that is profitable for the target. From a quick look, it seems like RISCV already does something like that.
Of course, doing this in InstCombine may also make sense -- the test diffs look like this might be slightly beneficial at the IR layer for the pattern emitted by LoopVectorize (e.g. see diffs in LoopVectorize/X86/small-size.ll). If we do want this in InstCombine, I'd expect a more principled approach that does not hardcode specific sizes. E.g. we could generally avoid masking off any sign bits, to avoid increasing the number of significant (signed) bits.
+1 to investigate whether this can be handled in DAG before adding something so specific to InstCombine - if value tracking can confirm that the upper bits are already zero there's no reason why the various SimplifyDemanded* calls shouldn't determine if a sext-imm is a better option for a specific target
The diffs are similar to what I was going for in:
d6498abc241b
...and I'm not aware of any regressions from that change.
So this seems fine to try in IR even before looking at codegen (and then maybe we won't need to do anything in codegen).
But I agree it should not be hard-coded based on i32 - avoid shrinking based on significant bits in general; possibly in combination with InstCombinerImpl::isDesirableIntType().
So I took a look at X86 which already has X86DAGToDAGISel::shrinkAndImmediate. Tested with:
long td(long a) { if (a & (3L << 62)) { __builtin_unreachable(); } return a & (-16L); }
Which gets:
%tobool.not = icmp ult i64 %a, 4611686018427387904 tail call void @llvm.assume(i1 %tobool.not) %and1 = and i64 %a, 4611686018427387888 ret i64 %and1
For some reason, however, the known bit information is lost by the time X86DAGToDAGISel::shrinkAndImmediate
is being called.
In SelectionDAG::MaskedValueIsZero earlier in the pipeline we get known as Zeros / Ones: c00000000000000f / 0, but in
X86ISelDAGToDag.cpp its: Zeros / Ones: 0 / 0.
Likewise I checked around TargetLowering::SimplifyDemandedBits in the ISD::AND case and by then as well LHSKnown
doesn't have the information needed: LHSKnown Zeros/Ones: 0 / 0.
My general (absolutely non-expert) opinion is handling this at InstCombine with TTI.getIntImmCostInst makes the most sense.
It avoids having to reroll this code in all target backend and seems to occur before some information loss that makes undoing
difficult.
Edit:
If I change the code to:
long foo(); long td(long a) { if (a & (3L << 62)) { return foo(); } return a & (-16L); }
It works, what I expect is happening is by the time code is in the DAG the llvm.assume has been removed.
Unless we want to propegate llvm.assume to the backends there seems to be a genuine
reason to keep this in InstCombine.
See my other comment, but when I tested this in the X86 backend but if the known bits are determined with llvm.assume it seems that information is lost by the time it reached the backend.
Of course, doing this in InstCombine may also make sense -- the test diffs look like this might be slightly beneficial at the IR layer for the pattern emitted by LoopVectorize (e.g. see diffs in LoopVectorize/X86/small-size.ll). If we do want this in InstCombine, I'd expect a more principled approach that does not hardcode specific sizes. E.g. we could generally avoid masking off any sign bits, to avoid increasing the number of significant (signed) bits.
Can you expand on the rationale behind not using TTI.getIntImmCostInst here? This doesn't seem like a target specific transform and "is this imm better than that one?" seems like an inherently target specific question.
Edit:
There already seems to be a target-specific exception for nearly this exact case in InstCombinerImpl::isDesirableIntType which AFAICT is querying isLegalInteger which is target-dependent.
I don't think InstCombinerImpl::isDesirableIntType() is what we want. For example on X86 64-bit is legal (so isDesirableIntType() will return true) so it would never
help avoid the extra instruction incurred by movabs.
Yes, that's correct, llvm.assume is not preserved in SDAG. If the motivation here are assumes in particular, then we cannot undo in the backend.
InstCombine is a target-independent canonicalization pass. It produces IR in a canonical form that other passes (or other transforms in InstCombine for that matter) can rely on. This canonical IR is independent of the target, beyond an unavoidable dependence on the DataLayout.
We could loosen this restriction (this would require an RFC on discourse), but at this point I don't really see evidence that it would be necessary to handle this case.
If we avoid shrinking that increases significant bits, the regressions I see are in @scalar_i32_lshr_and_negC_eq_X_is_constant1 and @scalar_i32_lshr_and_negC_eq_X_is_constant2, so we'd want to think about whether these can be avoided. (This would be necessary regardless of restrictions on the transform, it's just essentially impossible to find these cases if target-dependence is involved.)
llvm.assume is lost when we go to the DAG - but don't we generate a suitable AssertZext node?
Make is so that we never shrink if it break sign-extension.
Fix optimization that was relying on no common bits between known.zero and val
Fair enough, will drop it.
If we avoid shrinking that increases significant bits, the regressions I see are in @scalar_i32_lshr_and_negC_eq_X_is_constant1 and @scalar_i32_lshr_and_negC_eq_X_is_constant2, so we'd want to think about whether these can be avoided. (This would be necessary regardless of restrictions on the transform, it's just essentially impossible to find these cases if target-dependence is involved.)
Fixed those two cases. The issue was there was an assumption that Known.Zero & Value == 0 so it was using ~Known.Zero + Value == ~Known.Zero | Value. It would have broken on the original change had it tested with i64 in addition to i32
There is one more case in llvm/test/Transforms/LoopVectorize/induction.ll where we regress from zext -> sext.
I took a bit of a look, the issue is that by unsetting bits, we are essentially propegating analysis information. In InstCombinerImpl::visitSExt the check to isKnownNonNegative(Src, DL, 0, &AC, &CI, &DT) never finds the prior comparison (that was available InstCombineCompares) and since the and itself doesn't prohibit the sign but, it evaluates to false.
My feeling is we have 2 choices:
- Allow removal of sign bit if it transforms imm8 -> imm16/imm32 and imm16 -> imm32 as those all the types where we can get the sext -> zext optimization.
- Allow arbitrary bit simplification in InstCombine and undo it in a later pass (where maybe we can use TTI) and we still have llvm.assume.
- (No idea what this would do to compile time but seems sensible in a naive way), if we add llvm.assume(%cond) at head of all taken BBs and llvm.assume(!%cond) at head of all non-taken BBs this case is fixed (analysis doesn't find the br but it finds the assumes). In general this seems sensible as there is alot of duplicate logic for proving bits between the code that evaluates the CFG and code that evaluates the assumes. No idea what this would do to compile time (my guess is absolutely explode it).
I'm fine with this change, but let's wait for @spatel. The zext->sext changes in induction.ll don't look particularly problematic to me.
LGTM - see inline for some code comment adjustments.
llvm/lib/Transforms/InstCombine/InstCombineCompares.cpp | ||
---|---|---|
1725 | Adding is a bug, but I can't find a way to expose it since we always call SimplifyDemandedBits before we reach here. It's possible this patch will expose other bugs like that, so we'll need to watch for fallout. | |
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
47–53 | For IR, the target effects are a secondary concern. We should just say something like: | |
57 | undoinging -> undoing | |
59 | propegated -> propagated | |
64–65 | I'd change this sentence to make the outcome we're looking for explicit: // For anything larger than an 8-bit (byte) constant, avoid changing a // small magnitude negative number into a larger magnitude number. |
Forgot to mention - please edit the patch title/description to reflect the updated code (not i32/i64 specific any more).
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
---|---|---|
68 | I think we are missing an inverse case. |
Hmm, I was generally in favor of trying to preserve this. Is there opposition to allowing removal of sign bit if it can be used for
sext -> zext (so imm8 -> imm16/imm32 and imm16 -> imm32)?
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
---|---|---|
68 |
Good catch will check that out in V3. |
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp | ||
---|---|---|
68 |
I implemented this but saw zero cases where it could apply. |
I found another issue,
I changed the code to just disable ShrinkDemandedConstant and am notice a fair amount more
optimizations are breaking. We are just getting "lucky" with this patch b.c the test coverage
doesn't have constants that cross type size boundaries.
From what I can tell the following tests need to be fixed:
llvm/test/Transforms/InstCombine/demand_shrink_nsw.ll:: - foo llvm/test/Transforms/InstCombine/icmp-and-shift.ll: - icmp_eq_and_pow2_shl_pow2_negative1 - icmp_eq_and_pow2_shl_pow2_negative3 - icmp_eq_and_pow2_minus1_shl1 - icmp_eq_and_pow2_minus1_shl1_vec - icmp_ne_and_pow2_minus1_shl1 - icmp_ne_and_pow2_minus1_shl1_vec - icmp_eq_and_pow2_minus1_shl_pow2 - icmp_eq_and_pow2_minus1_shl_pow2_vec - icmp_ne_and_pow2_minus1_shl_pow2 llvm/test/Transforms/InstCombine/icmp-mul-and.ll: - mul_mask_pow2_sgt0 - mul_mask_pow2_eq4 - pr40493 - pr40493_neg1 (Maybe) - pr51551_neg1 llvm/test/Transforms/InstCombine/icmp.ll: - test17vec - test17a - test20vec - test20a llvm/test/Transforms/InstCombine/select-ctlz-to-cttz.ll: - PR45762 - PR45762_logical llvm/test/Transforms/InstCombine/sub.ll: - demand_low_bits_uses_commute
And then a variety of cases where we will lose cmp r, 0 which is often better than cmp r, C.
Alot of analysis/transforms seems to use the actual constant values rather than the known-needed constant values.
I think maybe adding a seperate pass that runs right before llvm.assume is lost will be a better way to do this.
Otherwise think we should fix all those cases first.
Adding is a bug, but I can't find a way to expose it since we always call SimplifyDemandedBits before we reach here. It's possible this patch will expose other bugs like that, so we'll need to watch for fallout.