This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] change conditions for transform of sub to xor
AbandonedPublic

Authored by spatel on Jul 13 2022, 8:30 AM.

Details

Reviewers
bjope
foad
nikic
Summary

As discussed in in D128123, converting sub to xor can cause codegen regressions, and it may not be reversible with respect to no-wrap as shown here:
https://alive2.llvm.org/ce/z/vebvhH

The PhaseOrdering test is based on the example provided in the other review.

Generally, we prefer bitwise-logic to math ops in IR even if that means dropping no-wrap flags -- because that allows better bit-tracking. But as noted, we need to improve the backend before easing the conditions here to avoid sub-optimal codegen.

This makes the IR fold the same as SDAG other than avoiding the transform in the presence of nsw/nuw. We probably want to update SDAG too -- either with a similar check as here or with a TLI hook to restrict the fold for targets that prefer sub-from-constant form.

Diff Detail

Event Timeline

spatel created this revision.Jul 13 2022, 8:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 8:30 AM
spatel requested review of this revision.Jul 13 2022, 8:30 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 13 2022, 8:30 AM
nikic added a comment.EditedJul 14 2022, 12:44 AM

As discussed in in D128123, converting sub to xor can cause codegen regressions, and it may not be reversible with respect to no-wrap as shown here:
https://alive2.llvm.org/ce/z/CgOlZU

Your example seems to show that nsw can be recovered, only nuw can't be?

As discussed in in D128123, converting sub to xor can cause codegen regressions, and it may not be reversible with respect to no-wrap as shown here:
https://alive2.llvm.org/ce/z/CgOlZU

Your example seems to show that nsw can be recovered, only nuw can't be?

Oops - I was altering the constant values to see when the transform is and isn't bidirectional, and I grabbed the wrong link. The code comments don't match the code in that link.
These are the non-bidirectional examples (the mask is negative, but the subtract constant is not):
https://alive2.llvm.org/ce/z/vebvhH

In all other cases, the transform seems to be bidirectional:
https://alive2.llvm.org/ce/z/ffGPvL

Also note that there is an existing regression test (not altered by this patch) that shows the potential harm if we do the transform and lose the no-wrap:

define i1 @test_negative_combined_sub_signed_overflow(i8 %x) {
; CHECK-LABEL: @test_negative_combined_sub_signed_overflow(
; CHECK-NEXT:    ret i1 false
;
  %y = sub nsw i8 127, %x
  %z = icmp slt i8 %y, -1
  ret i1 %z
}

If we convert sub to xor first, we can't then fold the compare to false.

spatel edited the summary of this revision. (Show Details)Jul 14 2022, 4:40 AM
bjope added a comment.Jul 14 2022, 2:08 PM

Thanks! Verified that it solves the regressions that I mentioned in D128123.

For future follow-ups: Maybe value/bit tracking implementations for sub can be taught about this kind of sub being an xor in disguise to get the wanted bit tracking also without transforming to sub? It is probably easier to deal with "known bits" of the non-constant operand while doing value/bit tracking compared to doing it during SCEV analysis (for the reverse - i.e. understanding that an xor is a sub in disguise).

This patch LGTM. But since @nikic had comments earlier I think you need to wait for an accept from Nikita as well.

nikic added a comment.Jul 15 2022, 1:14 AM

At a high level, I find the notion of disabling folds to preserve nowrap flags somewhat concerning. Doing this means that an improvement in one place (which infers more nowrap flags) can disable a later optimization, because it now considers preserving those flags more important. We sometimes do this for correctness reasons (mainly in that an operation with nowrap flags cannot be guaranteed-not-poison), but do we have any precedent in InstCombine for doing this specifically to preserve the flags?

In this specific case, I think we should question the value of the entire transform, at least on the IR level. Is there some specific motivation for it (i.e. some chain of follow-on transforms it enables?) or are we just doing it because we can? Our KnownBits handling of sub is optimal, so for computeKnownBits() I believe it shouldn't matter whether the instruction is a sub or a xor (in cases where we convert one into the other based on known bits only). However, for ConstantRange for example we also have an optimal sub implementation, but xor is heavily approximated. Similarly SCEV does not understand xor, but understands sub. So I question the premise here that we want to prefer xor over sub as a canonical form. I believe on average, our support for sub is better (the only xor variant that has decently universal support is xor by minus one, aka not).

At a high level, I find the notion of disabling folds to preserve nowrap flags somewhat concerning. Doing this means that an improvement in one place (which infers more nowrap flags) can disable a later optimization, because it now considers preserving those flags more important. We sometimes do this for correctness reasons (mainly in that an operation with nowrap flags cannot be guaranteed-not-poison), but do we have any precedent in InstCombine for doing this specifically to preserve the flags?

I don't know of any other cases where we avoid a transform based on no-wrap. AFAIK, we have always leaned towards doing a transform to a simpler op even if it meant losing no-wrap.

In this specific case, I think we should question the value of the entire transform, at least on the IR level. Is there some specific motivation for it (i.e. some chain of follow-on transforms it enables?) or are we just doing it because we can? Our KnownBits handling of sub is optimal, so for computeKnownBits() I believe it shouldn't matter whether the instruction is a sub or a xor (in cases where we convert one into the other based on known bits only). However, for ConstantRange for example we also have an optimal sub implementation, but xor is heavily approximated. Similarly SCEV does not understand xor, but understands sub. So I question the premise here that we want to prefer xor over sub as a canonical form. I believe on average, our support for sub is better (the only xor variant that has decently universal support is xor by minus one, aka not).

We have been canonicalizing away from sub recently, and I assumed xor was better than sub for bit-tracking, so I didn't consider the opposite transform.

I tracked down the original fold to:
https://github.com/llvm/llvm-project/commit/010337c8385b919f
...so the motivation was really about codegen, but we've also generally assumed that bitwise ops are a better canonicalization than math.

The fold was moved to InstCombiner::visitSub() with:
D9417 / https://github.com/llvm/llvm-project/commit/ec6833420fa3c6f44c0e464ed2cbc6c483f37781
...and that brings us to the recent history:

  1. I noticed that we could use the transform in SDAG, so I proposed to copy the InstCombine: D128080
  2. Then, I saw that x86 had a stronger version of the fold, so I made that a generic SDAG fold: D128123
  3. There was feedback that we could relax the constraints a bit more, so: d0eec5f7e787
  4. I figured if we're doing this fold in both places, we should make it consistent: 79bb915fb60b

So it just accumulated to current state.
This reminds me of turning 'add' into 'or' with no common bits set (no carries) - that led to a long series of work-arounds to match new patterns with 'or'.
I'm not sure what (if any) SCEV changes were made to accommodate that transform, but it makes sense to me if we want to avoid that with sub->xor.

I think this patch is going to effectively prevent the transform in almost all cases because we seem to infer no-wrap on these patterns effectively. We could commit this, wait a bit to see if there's any fallout, then remove the transform completely.

That raises the question: do we want to canonicalize xor to sub? Maybe we just ignore that and live with both possibilities.

This change is causing issues with creating a SCEV expression for the subtract. This leads to not being able to delinearizing some expressions which can have an impact on loop passes like loop interchange. For example, with the attached{F23852262} IR, we are no longer able to compute a simple SCEV expression for the subtract which is now an xor:
Output from -passes=print<scalar-evolution>

With the subtract:

  %_val_lda_ = load i32, i32* %.lda, align 4
  -->  %_val_lda_ U: full-set S: full-set
  %_conv = sext i32 %_val_lda_ to i64 
  -->  (sext i32 %_val_lda_ to i64) U: [-2147483648,2147483648) S: [-2147483648,2147483648)
  %_mult_tmp = shl nsw i64 %_conv, 3
  -->  (8 * (sext i32 %_val_lda_ to i64))<nsw> U: [0,-7) S: [-17179869184,17179869177)
  %_sub_tmp4 = sub nuw nsw i64 -8, %_mult_tmp
  -->  (-8 + (-8 * (sext i32 %_val_lda_ to i64))<nsw>)<nsw> U: [0,-7) S: [-17179869184,17179869177)
%a_ix_dim_0_ = getelementptr i8, i8* %a_rvo_based_addr_, i64 %_ix_x_len11
-->  {(-8 + %.a),+,(8 * (sext i32 %_val_lda_ to i64))<nsw>}<%_loop_2_do_>


With the xor:

  %_val_lda_ = load i32, i32* %.lda, align 4
  -->  %_val_lda_ U: full-set S: full-set
  %_conv = sext i32 %_val_lda_ to i64 
  -->  (sext i32 %_val_lda_ to i64) U: [-2147483648,2147483648) S: [-2147483648,2147483648)
  %_mult_tmp = shl nsw i64 %_conv, 3
  -->  (8 * (sext i32 %_val_lda_ to i64))<nsw> U: [0,-7) S: [-17179869184,17179869177)
  %_sub_tmp4 = xor i64 %_mult_tmp, -8
  -->  %_sub_tmp4 U: [0,-7) S: [-17179869184,17179869184)
%a_ix_dim_0_ = getelementptr i8, i8* %a_rvo_based_addr_, i64 %_ix_x_len11 
-->  {((8 * (sext i32 %_val_lda_ to i64))<nsw> + %_sub_tmp4 + %.a),+,(8 * (sext i32 %_val_lda_ to i64))<nsw>}<%_loop_2_do_>

Note that we do not recognize the xor as a subtract of 8 SCEV expression. This prevents further simplifications on expressions using _sub_tmp4, because it is now represented as a symbol rather than a recognized SCEV operation.
Would it be possible to revert this until we teach SCEV how to handle the XOR?

Note that we do not recognize the xor as a subtract of 8 SCEV expression. This prevents further simplifications on expressions using _sub_tmp4, because it is now represented as a symbol rather than a recognized SCEV operation.
Would it be possible to revert this until we teach SCEV how to handle the XOR?

Please correct me if I've misunderstood: would this patch solve the problem that you are seeing? If we want to revert, that would be 79bb915fb60b ?

Note that we do not recognize the xor as a subtract of 8 SCEV expression. This prevents further simplifications on expressions using _sub_tmp4, because it is now represented as a symbol rather than a recognized SCEV operation.
Would it be possible to revert this until we teach SCEV how to handle the XOR?

Please correct me if I've misunderstood: would this patch solve the problem that you are seeing? If we want to revert, that would be 79bb915fb60b ?

Oh sorry! Yes, I meant to post this comment for 79bb915fb60b. And I have double checked that this patch solves the issue. Looking forward to this being committed, if we are not reverting 79bb915fb60b. Thanks.

Note that we do not recognize the xor as a subtract of 8 SCEV expression. This prevents further simplifications on expressions using _sub_tmp4, because it is now represented as a symbol rather than a recognized SCEV operation.
Would it be possible to revert this until we teach SCEV how to handle the XOR?

Please correct me if I've misunderstood: would this patch solve the problem that you are seeing? If we want to revert, that would be 79bb915fb60b ?

Oh sorry! Yes, I meant to post this comment for 79bb915fb60b. And I have double checked that this patch solves the issue. Looking forward to this being committed, if we are not reverting 79bb915fb60b. Thanks.

Thanks for confirming.

Ping @nikic - are you ok with this patch? (and follow-up to remove the fold entirely?)

bjope added a subscriber: uabelho.Jul 22 2022, 10:30 AM

Note that we do not recognize the xor as a subtract of 8 SCEV expression. This prevents further simplifications on expressions using _sub_tmp4, because it is now represented as a symbol rather than a recognized SCEV operation.
Would it be possible to revert this until we teach SCEV how to handle the XOR?

Please correct me if I've misunderstood: would this patch solve the problem that you are seeing? If we want to revert, that would be 79bb915fb60b ?

Oh sorry! Yes, I meant to post this comment for 79bb915fb60b. And I have double checked that this patch solves the issue. Looking forward to this being committed, if we are not reverting 79bb915fb60b. Thanks.

Thanks for confirming.

Ping @nikic - are you ok with this patch? (and follow-up to remove the fold entirely?)

This patch is better than nothing IMHO (the regressions we currently see without either applying this patch or reverting the recent patches involving sub->xor transforms are a bit annoying).
I also think that it is a bit weird (or rather hard to follow) if we only canonicalize based on if those NSW/NUW attributes are present or not (at least if we also want to canonicalize xor->sub for the opposite condition). So some kind of follow up to either just keep (or canonicalize) to sub up until instruction selection could be nice. If there are motivating examples for using xor instead of sub in the middle-end, then we might go in the other direction if we first make sure SCEV handles the xor in the same way as if it had been a sub.

Reverting rG79bb915fb60b2cd220d89e3bb54f67abb8cdb7bd is also an acceptable solution. Afaict it has only caused regressions, but I haven't seen (or heard about) any benchmarks being improved by that patch.

I think the right starting point here is to revert rG79bb915fb60b2cd220d89e3bb54f67abb8cdb7bd.

This reminds me of turning 'add' into 'or' with no common bits set (no carries) - that led to a long series of work-arounds to match new patterns with 'or'.
I'm not sure what (if any) SCEV changes were made to accommodate that transform, but it makes sense to me if we want to avoid that with sub->xor.

We have special SCEV handling for this here: https://github.com/llvm/llvm-project/blob/7068aa98412ade19a34b7ed126f4669f581b2311/llvm/lib/Analysis/ScalarEvolution.cpp#L7599 And I believe we have a bunch of haveNoCommonBitsSet() calls in various places to undo the add->or fold. I have some doubts about the value of the add->or canonicalization as well, but the justification is stronger there, and we already have necessary mitigations in place.

That raises the question: do we want to canonicalize xor to sub?

That would be the principled thing to do, but I'm not sure it would matter much in practice.

spatel abandoned this revision.Jul 22 2022, 1:07 PM

Ok - I reverted the earlier patch that created more xor with:
08091a99ae48

That should solve the immediate regressions that were mentioned here (and we still have the PhaseOrdering test in place, so it will be harder to trip over this in the future).

Note: we still convert sub from low-mask constant (eg 0x003fff...) into xor (that was the original fold that's been around for many years). I'm not sure if there's something about that pattern that avoids SCEV trouble, or if there's already a work-around for that pattern in SCEV.