This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
8
DAGCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
urem-seteq-vec-splat.ll
-
urem-seteq.ll
-
X86/
4
jump_sign.ll
4/4
urem-seteq-vec-splat.ll
4
urem-seteq.ll

Differential D50222

[CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)
AbandonedPublic

Authored by xbolva00 on Aug 2 2018, 8:58 PM.

Download Raw Diff

Details

Reviewers

MatzeB
kparzysz
efriedma
craig.topper
foad
RKSimon
javed.absar
lebedev.ri
gnudles
hermord

Summary

This implements an optimization described in Hacker's Delight 10-17: when C is constant, the result of X % C == 0 can be computed more cheaply without actually calculating the remainder. The motivation is discussed here: https://bugs.llvm.org/show_bug.cgi?id=35479.

Original patch author (Dmytro Shynkevych) Notes:

In principle, it's possible to also handle the X % C1 == C2 case, as discussed on bugzilla. This seems to require an extra branch on overflow, so I refrained from implementing this for now.
An explicit check for when the REM can be reduced to just its LHS is included: the X % C == 0 optimization breaks test1 in test/CodeGen/X86/jump_sign.ll otherwise. I hadn't managed to find a better way to not generate worse output in this case.
I haven't contributed to LLVM before, so I tried to select reviewers based on who I saw in other reviews. In particular, @kparzysz: a Hexagon test is modified and I have no familiarity with the architecture; hopefully my changes are valid.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

hermord added inline comments.Aug 2 2018, 9:07 PM

test/CodeGen/Hexagon/swp-const-tc2.ll
9 ↗	(On Diff #158905)	To the best of my understanding, it is valid to not perform the multiplication at all, since its can never be used because of the infinite loop at `b3`. This is the output I get: // %bb.0: // %b0 { r0 = ##-1431655765 } { r0 = #0 } { r0 = memw(r0+#0) } .p2align 4 .LBB0_1: // %b3 // =>This Inner Loop Header: Depth=1 { jump .LBB0_1 }

hermord added inline comments.Aug 2 2018, 9:08 PM

test/CodeGen/Hexagon/swp-const-tc2.ll
9 ↗	(On Diff #158905)	*its result can never be used

hermord added a reviewer: foad.Aug 2 2018, 9:12 PM

lebedev.ri added a subscriber: lebedev.ri.Aug 3 2018, 12:07 AM

xbolva00 edited reviewers, added: craig.topper; removed: foad.Aug 3 2018, 3:31 AM

xbolva00 added a subscriber: xbolva00.

xbolva00 added reviewers: foad, RKSimon.Aug 3 2018, 3:36 AM

RKSimon added inline comments.Aug 3 2018, 4:14 AM

include/llvm/CodeGen/TargetLowering.h
3502 ↗	(On Diff #158905)	Please can you not include the Divisor - we are aiming for full vector support for divisions, not just splats - see D50185

kparzysz added inline comments.Aug 3 2018, 12:34 PM

test/CodeGen/Hexagon/swp-const-tc2.ll

9 ↗

(On Diff #158905)

Can you make these changes instead?

-define void @f0() {
+define i32 @f0(i32* %a0) {
 b0:
   br label %b1
 
 b1:                                               ; preds = %b1, %b0
   %v0 = phi i32 [ 0, %b0 ], [ %v9, %b1 ]
   %v1 = phi i32 [ 0, %b0 ], [ %v8, %b1 ]
-  %v2 = load i32, i32* undef, align 4
+  %v2 = load i32, i32* %a0, align 4
   %v3 = add nsw i32 %v1, 1
   %v4 = srem i32 %v2, 3
   %v5 = icmp ne i32 %v4, 0
   %v6 = sub nsw i32 0, %v2
   %v7 = select i1 %v5, i32 %v6, i32 %v2
   %v8 = mul nsw i32 %v3, %v7
   %v9 = add nsw i32 %v0, 1
   %v10 = icmp eq i32 %v9, 1
   br i1 %v10, label %b2, label %b1
 
 b2:                                               ; preds = %b1
   %v11 = phi i32 [ %v8, %b1 ]
   br label %b3
 
 b3:                                               ; preds = %b3, %b2
-  br label %b3
+  ret i32 %v11
 }

Made changes as requested.

majnemer added a subscriber: majnemer.Aug 4 2018, 10:35 PM

majnemer added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3635 ↗	(On Diff #159124)	Is this supposed to be ashr even if we are doing UREM? Seems a little unusual but if this is correct, it probably earns a comment.

@majnemer Thanks; this was, in fact, incorrect. Now, to simplify the logic, the absolute value of D is taken and lshr is used.

In D50222#1188879, @hermord wrote:

@majnemer Thanks; this was, in fact, incorrect. Now, to simplify the logic, the absolute value of D is taken and lshr is used.

There were non-NFC changes to the code, but there was no test changes at all.
I suspect the test coverage is seriously lacking.

majnemer added inline comments.Aug 5 2018, 2:39 PM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3632–3639 ↗	(On Diff #159227)	The sign bit may be set for APInts fed into URem operations: all isNegative does is check that the MSB is set. I don't think it is OK to branch on the sign bit here without considering if we are in dealing with URem or SRem. I think what you want is: if (IsSigned) { D0 = D.ashr(K); } else { D0 = D.lshr(K); }

As pointed out by @lebedev.ri, tests were inadequate. Added new tests and stepped through existing ones, adding descriptions. This uncovered two more bugs, which were also fixed here (an extraneous Q.lshrInPlace() and division by D instead of D0). As far as I can see, the coverage now seems reasonable; please point out any cases I missed.

Regarding @majnemer's suggestion, my bad: I should have mentioned in the last update that this algorithm actually assumes that D is positive ("because d is odd and, as we are assuming, positive and not equal to 1..." [Hacker's Delight, p.227]). It probably can be adapted to the D < 0 case with some tweaks, but it seems easier to just take the absolute value of D since this does not change the answer.

nickdesaulniers added a subscriber: nickdesaulniers.Aug 7 2018, 2:10 PM

A few passing-by thoughts.

include/llvm/CodeGen/TargetLowering.h
3502 ↗	(On Diff #159505)	This should be `SmallVectorImpl<SDNode *> &Created` like in all the other cases.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3841–3861	I think this is independent from the rest? Can you split it into a separate review?
3863–3866	This feels slightly backwards. Can you do this in `DAGCombiner::visitSETCC()`, and thus avoid all this use-dance?
19851–19853	Would be good to have a test (in a new file) for that.
19853	Please format your changes with clang-format (you can automate that via git pre-commit hook)
19860	SmallVector<SDNode *, 8> Built; like in all the other functions.
19861–19863	Early return please. SDValue NewCond = TLI.BuildREMEqFold(REMNode, CondNode, DAG, Built); if (!NewCond) return SDValue();
19864–19869	Hm. Why is this different from what `DAGCombiner::BuildUDIV()` does? I expected to see just the: for (SDNode *N : Built) AddToWorklist(N); return NewCond;
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606 ↗	(On Diff #159505)	This should be `SmallVectorImpl<SDNode *> &Created` like in all the other cases.
3631 ↗	(On Diff #159505)	Separate with newline.
3633–3634 ↗	(On Diff #159505)	Add self-explanatory assertion?
3645 ↗	(On Diff #159505)	This is begging to be wrong. Can you create two variables on different lines?
test/CodeGen/Hexagon/swp-const-tc2.ll
36 ↗	(On Diff #159505)	I think the change to this test can land right away separately from the rest of the changes.
test/CodeGen/X86/rem.ll
81 ↗	(On Diff #159505)	Why not instead put all this into a new `test/CodeGen/X86/rem-seteq.ll` file? I'm not sure whether it is worth it splitting the `urem` and `srem` tests into different files though.

Also, really minor, when adressing/fixing an inline comment, could you please actually mark that inline comment as fixed (via that checkbox), else it really isn't clear in which state which comment is..

hermord mentioned this in D50944: [Hexagon] [Test] Remove undef and infinite loop from test.Aug 18 2018, 7:16 PM

Thank you for the review, @lebedev.ri. Sorry for the delay; resolved most of the issues. Quick summary of the changes:

BuildREMEqFold is now invoked from SimplifySetCC, as it logically should be. The signature is changed to mirror the other SimplifySetCC helpers.
Tests are moved to a separate file. The format is slightly changed to get rid of irrelevant selects. Size optimization tests are added. The Hexagon test fix is split into a separate patch: D50944.
The | LHS | < | RHS | optimization (call it (*)) is removed from this patch.

Note: since changes were split into two patches, this now fails tests until D50944 gets merged. Even when that happens, this will still fail test1 in CodeGen/X86/jump_sign because of (*). This failure should probably be fixed by implementing the (*) in a separate patch, but I'm unsure as to the best way to go about it: since BuilREMEqFold moved out of visitREM, it will get invoked before any optimizations in visitREM have had a chance to run, as the SETCC node is processed first. As I see it, there are the following options:

Move BuildREMEqFold back into visitREM. This is ugly for other reasons, as @lebedev.ri pointed out, but it does solve this particular hurdle. (*) can then be added right before it.
Make (*) a separate function and invoke it in both visitREM and SimplifySetCC (before calling BuildREMEqFold). This may result in it running twice, which is probably bad. Computing KnownBits looks like it's expensive.
Check if (*) is possible in SimplifySetCC and, if so, don't touch the node. visitREM then implements the actual optimization.
Ignore (*) and implement an equivalent optimization for multiplication on the output of BuildREMEqFold.
Accept the probably rare regression (dividend must be known to be smaller than the divisor AND the result has to be fed into a comparison with zero) and edit the test.

I lack experience to make the right call here and would much appreciate any suggestions.

hermord marked 4 inline comments as done.Aug 20 2018, 12:12 AM

hermord added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3633–3634 ↗	(On Diff #159505)	Sorry, I couldn't figure out what the assertion should be. The negative case can still be handled just fine; do you mean that I should assert (D0 =/= 1) and then make the caller check for that?

lebedev.ri added a parent revision: D50944: [Hexagon] [Test] Remove undef and infinite loop from test.Aug 20 2018, 12:12 AM

lebedev.ri added inline comments.Aug 20 2018, 12:18 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3841–3861	You could put this fold into the middle-end, into instsimplify, if we don't do it already. But since it needs `computeKnownBits()`, some measurements would be needed (how many times does it fire on some codebase? possibly, measure compile-time impact)

Some more high-level nits/thoughs:

Other than -Os, do we always want to do this, for every target, with no customization hook?
Right now the tests look good (other than the run-line question), please commit them now.
Similarly, if we want to do it for other arches, also commit a copy of that same rem-seteq.ll test for aarch64 at least.
I think this should be further split into two reviews, unsigned case first :/

include/llvm/CodeGen/TargetLowering.h
3698 ↗	(On Diff #161419)	Separate with newline?
lib/CodeGen/SelectionDAG/TargetLowering.cpp
2780–2782 ↗	(On Diff #161419)	if (SDValue Folded = BuildREMEqFold(VT, N0, N1, Cond, DCI, dl)) return Folded;
3723–3725 ↗	(On Diff #161419)	The produced patterns are rather different between unsigned and signed cases. I think this should be at least done in two steps (two reviews, unsigned first), and likely these should be done by two separate functions.
test/CodeGen/X86/rem-seteq.ll
2 ↗	(On Diff #161419)	Any reason this is targeting `i386` specifically? I think we should test: ; RUN: llc -mtriple=i686-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X86 ; RUN: llc -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X64,NOBMI2 ; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+bmi2 < %s \| FileCheck %s --check-prefixes=CHECK,X64,BMI2

Thank you for the review, @lebedev.ri, addressed:

- Added isIntDivCheap as an additional condition preventing this optimization. If this isn't customizable enough, we could probably do something like if (isIntDivCheap || (minsize && isIntDivShort)) where isIntDivShort can be overriden per-target.
I'm not an LLVM developer; as far as I understand, I can't commit anything
Added tests for AArch64
Removed the signed bits

CodeGen/X86/jump_sign.ll still isn't passing; I've been trying different things, but none seem to make for a satisying fix (if I introduce the optimizaion in InstCombine, I'll still have to change the test to pipe through opt, which I'm not sure is better than just chaning the test to match current output). On a related note, after looking at it, should this entire thing be in InstCombine, actually?

Herald added a reviewer: javed.absar. · View Herald TranscriptAug 29 2018, 6:32 PM

hermord marked 3 inline comments as done.Aug 29 2018, 6:35 PM

Diffusion mentioned this in rL341046: [Hexagon][Test] Remove undef and infinite loop from test.Aug 30 2018, 2:34 AM

Baseline tests committed, rebase please.

test/CodeGen/X86/urem-seteq.ll
4	Actually, looking at the check-lines, i'm not sure we want to check the bmi2 version. I'm not seeing anything there that would benefit from anything above baseline, i think.

lebedev.ri added inline comments.Aug 30 2018, 2:39 AM

test/CodeGen/X86/jump_sign.ll
242–244	What do you mean by breaks? Is this a miscompilation?
test/CodeGen/X86/urem-seteq.ll
4	I've meant to drop this check-line, but apparently forgot. I'm not 100% sure if we want/don't want it.

lebedev.ri added a reviewer: lebedev.ri.Aug 30 2018, 3:45 AM

Starting to look much better. Some more nits.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3727 ↗	(On Diff #163245)	assert((Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) && "Only applicable for [in]equality comparisons.");
3727 ↗	(On Diff #163245)	// fold (seteq/ne (urem N, D), 0) -> (setule/ugt (rotr (mul N, P), K), Q) (it would be great to use more descriptive names to differentiate between `C`onstants and not constants, too)
3749 ↗	(On Diff #163245)	// Decompose D into D0 * 2^K
3750 ↗	(On Diff #163245)	bool DivisorIsEven = K != 0;
3753–3755 ↗	(On Diff #163245)	So what we are really checking here, is that `D` is not power of two. How about simplifying this into: APInt D = Divisor->getAPIntValue(); unsigned W = D.getBitWidth(); // If D is power of two, D0 would be 1, and we cannot build this fold. if(D.isPowerOf2()) return SDValue(); // Decompose D into D0 * 2^K ... BUT. I do not understand from where does this requirement comes from? Can you quote specific part of [[ https://doc.lagout.org/security/Hackers%20Delight.pdf \| `10–17 Test for Zero Remainder after Division by a Constant` ]] that warrants this?
3757–3761 ↗	(On Diff #163245)	Would [[ http://llvm.org/doxygen/classllvm_1_1APInt.html#a28ee15b0286415cce5599d4fb9f9ce02 \| `APInt::multiplicativeInverse()` ]] work here?
3764–3765 ↗	(On Diff #163245)	I would think just writing this as one line would be as clean APInt Q = APInt::getAllOnesValue(W).udiv(D0);
3771 ↗	(On Diff #163245)	I do not know if this is needed? At least rL338079 removed such a call.
3776 ↗	(On Diff #163245)	if(DivisorIsEven) {

lebedev.ri added inline comments.Aug 30 2018, 6:06 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3727 ↗	(On Diff #163245)	(it would be great to use more descriptive names to differentiate between Constants and not constants, too) But thinking about it a bit more, probably don't rename the variables. Consistency [with the orig doc] may be best.

lebedev.ri added inline comments.Aug 30 2018, 6:55 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3746–3755 ↗	(On Diff #163245)	Well, i see it now - https://rise4fun.com/Alive/aXUu. Then this should be: APInt D = Divisor->getAPIntValue(); if(D.isPowerOf2()) { // rem by power-of-two is better represented by and-mask. return SDValue(); } // Decompose D into D0 * 2^K ...
3764 ↗	(On Diff #163245)	Also, move the `W` here, to the use.

Comments addressed. The minsize condition needs some tweaking, it seems: the code with it works out to actually be longer on X86. Perhaps there should really be something like isIntDivShort.

Re: test1, by it being broken I only mean that it doesn't pass: it's not a miscompilation, just a less optimal output than desired.

hermord marked 10 inline comments as done.Aug 30 2018, 9:25 PM

hermord added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3753–3755 ↗	(On Diff #163245)	The algorithm also fails in general in this case because the inverse of 1 is 1 itself, and we get, for example: Given: N: 1, D: 4, C: 0 Then: K = 2 D0 = 1 P = 1 Q = (2^32 - 1)/1 = 2^32 - 1 And so: (P >>rot 2) == 2^30 < 2^32 - 1 == Q, so the condition holds, but N % D != 0 The book mentions this in passing: "This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not equal to 1 [...]" (p. 227).
test/CodeGen/X86/jump_sign.ll
242–244	Oh my bad, to clarify: I was referring to `test1` in this file and not the one that's changed here. This one is fine, but the code produced for `test1` after this patch would be less optimal than before. The output of `BuildUREMEqFold` just doesn't seem to fall into patterns that we're already optimizing well in this particular case (`N` is extended from `i1`, then `AND`'ed and then passed into an UREM), and we'd probably have to do a new `KnownBits`-based optimization elsewhere to avoid this. Via `InstCombine` this can actually be fixed without close to no extra overhead because we are already computing the necessary bits anyway.

Updated AArch64 tests.

This looks about right as far as i can tell.
But would be best if others could review, too.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3727 ↗	(On Diff #163245)	The comment change wasn't applied, i'm not aware of `gte` predicate, but i do know `[u/s]ge`. Also, reading the actual code, `ugt` was meant it seems.
3772 ↗	(On Diff #163480)	You can move this closer to where it's actually used.
3789 ↗	(On Diff #163480)	You don't actually modify `D` as far as i can tell? I think it should be `const APInt &D`
3799 ↗	(On Diff #163480)	Ok, thank you for the explanation regarding `D0 != 1`. I think here now we can just add an assert, to document it. APInt D0 = D.lshr(K); assert(!D0.isOneValue() && "The fold is invalid for D0 of 1. Unreachable since we leave powers-of-two to be improved elsewhere.");
3814 ↗	(On Diff #163480)	Drop the comment about signed cases? I strongly believe that should be done in a new function.
3824 ↗	(On Diff #163480)	Drop the comment about signed cases? I strongly believe that should be done in a new function.

Please add uniform and non-uniform vector test cases as well.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3782 ↗	(On Diff #163480)	It should be trivial for you to add non-uniform vector support in this patch, following the patterns I did for the other divsion-by-constant builders.

And according to GCC mailing list (https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00193.html) and their stats, unsigned case of X % C1 == C2 is still worth to handle as well.

In D50222#1221943, @RKSimon wrote:

Please add uniform and non-uniform vector test cases as well.

In D50222#1223180, @xbolva00 wrote:

And according to GCC mailing list (https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00193.html) and their stats, unsigned case of X % C1 == C2 is still worth to handle as well.

@hermord to reiterate, as far i'm concerned, this only needs vector tests.
Everything else should go into new reviews (srem, urem nonsplat, srem nonsplat; non-zero constants (another 4 reviews ideally?)
As for the vector tests, just add them here, i will precommit.
I think, you want to operate on <4 x i32>, with the run line

; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SSE2

and

; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s | FileCheck %s

Just completely duplicate the existing test files with -vector suffix. I.e. if you have

define i32 @test_urem_odd(i32 %X) nounwind readnone {
  %urem = urem i32 %X, 5
  %cmp = icmp eq i32 %urem, 0
  %ret = zext i1 %cmp to i32
  ret i32 %ret
}

It will be

define <4 x i32> @test_urem_odd_vec(<4 x i32> %X) nounwind readnone {
  %urem = urem <4 x i32> %X, <i32 5, i32 5, i32 5, i32 5>
  %cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>
  %ret = zext <4 x i1> %cmp to <4 x i32>
  ret <4 x i32> %ret
}

I'm not quite sure about nonsplat tests. things to keep in mind there:

You have two constants, so add three positive tests with one constant element being undef (i.e. first test with undef in the first constant, second test with undef in the second constant, and a test with undef in both)
The codepath is different depending on whether the divisor is even. So you want to add a test where half of the divisor is even, and half is odd
???

hermord updated this revision to Diff 164794.Sep 10 2018, 9:34 PM

Apologies again for the delay.

Made cosmetic changes as directed by @lebedev.ri and added vector tests.
Disabled the fold for vector divisors with even values (see inline comment and test_urem_even_vec_i16).

Nonsplat divisors, nonzero C2 and possibly target customization (something like bool shouldBuildUREMEqFold(SDValue Divisor, SDValue CompTarget, bool isSREM, const DAGCombinerInfo &DCI)?) will be in separate patches.

The splat tests are only partly the same as the scalar ones because bit30/bit31 is not handled any differently in vectors and so is redundant; on the other hand, testing the <4 x i16> case turned out to be valuable (identified the ROTR-induced crash descibed in inline comment).

hermord marked 5 inline comments as done.Sep 10 2018, 9:49 PM

hermord added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3782 ↗	(On Diff #163480)	I've got conflicting directions for this patch (add nonsplat here or submit it separately), so I didn't touch that part in this update.
3799 ↗	(On Diff #163480)	This turns out to actually be reachable: the logic that handles REM by powers of two is in `visitREM`, which is invoked after `visitSetCC` from where we would've come here. I kept this as an early return if that's OK.
test/CodeGen/X86/jump_sign.ll
410	This is the `test1` regression I mentioned in previous updates. I've made it explicit now, for the lack of a clean fix that I can see (without `computeKnownBits`).
test/CodeGen/X86/urem-seteq.ll
4	Should I drop it? On a related note, I could add `AVX2` to the vector tests on `X86` if that's likely to be useful.

Diffusion mentioned this in rL341943: [Hexagon] [Test] Remove undef and infinite loop from test.Sep 11 2018, 7:09 AM

RKSimon added inline comments.Sep 11 2018, 8:35 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3812 ↗	(On Diff #164794)	APInt::lshr returns a value not a reference.
3831 ↗	(On Diff #164794)	APInt::zext returns a value not a reference.
3836 ↗	(On Diff #164794)	Again, value not a reference
3843 ↗	(On Diff #164794)	You must ensure that MUL + ROTR are legal when necessary - pass a IsAfterLegalization flag into the function and check with isOperationLegalOrCustom/isOperationLegal - see TargetLowering::BuildSDIV
test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
2 ↗	(On Diff #164794)	Run with avx2 as well for better test coverage.
6 ↗	(On Diff #164794)	You don't need nonsplat in the test name - its in the filename.
test/CodeGen/X86/urem-seteq-vec-splat.ll
2	Run with avx2 as well for better test coverage.
27	Use legal types - 8 x i16 etc.

Updated the tests in rL341953.
As far i'm concerned this is good to go.
Vector and vector non-splat support sounds like a great task for the first follow-up :)
But, still let's wait for others to comment for a bit..

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3799 ↗	(On Diff #163480)	Sounds good.
test/CodeGen/X86/urem-seteq.ll
4	Right, good idea. I checked all the various sse/avx versions, and kept the unique ones that make a difference here.

This revision is now accepted and ready to land.Sep 11 2018, 8:38 AM

Thank you for the review, @RKSimon. Made changes as directed.

hermord marked 8 inline comments as done.Sep 11 2018, 4:09 PM

hermord marked an inline comment as done.

In D50222#1231315, @hermord wrote:

Thank you for the review, @RKSimon. Made changes as directed.

Can you please rebase your differential, so the diff is as compared to the svn?
(i have previously committed the vector tests, but they will need to be recommitted now).

Rebased.

Seems fine for trunk now..

RKSimon requested changes to this revision.Sep 12 2018, 1:41 PM

RKSimon added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606 ↗	(On Diff #159505)	Isn't this still missing?
test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
6 ↗	(On Diff #165115)	Please add these test cases back.
test/CodeGen/X86/urem-seteq-vec-splat.ll
6	Please add these test cases back.

This revision now requires changes to proceed.Sep 12 2018, 1:41 PM

Re-added previously commited tests.

hermord marked 2 inline comments as done.Sep 13 2018, 9:25 PM

hermord added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606 ↗	(On Diff #159505)	The approach is different now: `BuildUREMEqFold` is now called from `TargetLowering::SimplifySetCC` and not from `DAGCombiners::visitREM`, and l tried to match the style of the other helpers called from there (like `simplifySetCCWithAnd`).

@RKSimon should I make any other changes to this?

Hm, i stalled the review here it seems, sorry.
See inline comment.

I think you only updated the splat vector tests to use legal vectors (<8 x i16> instead of <4 x i16>),
but did not update the non-splat test files?
Can you update them too please? I'll then re-commit the tests once more, hopefully the last time :)

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606 ↗	(On Diff #159505)	Looking at the code on which this was originally based, i think i was slightly wrong. This should still have a pointer to smallvector, and it should push back the newly created `SDValue`s. And this function should be called from a new helper function that would make proper use of that. See `TargetLowering::BuildSDIVPow2()` (which is the implementation), vs `DAGCombiner::BuildSDIVPow2()` (which is the wrapper).

Hm, i stalled the review here it seems, sorry.
See inline comment.

@lebedev.ri Sorry, couldn't find any instances of 4 x i16 in the nonsplat tests; I think I only had those in the splat ones. The patch no longer applied cleanly, so I rebased it again.

hermord added inline comments.Sep 25 2018, 6:38 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606 ↗	(On Diff #159505)	Ah sorry, just to clarify, there used to be a `DAGCombiner::BuildREMEqFold` in an earlier version of this patch, but my rationale for removing it was that: We wanted to generate this fold when visiting the SETCC and not the UREM. `DAGCombiner::visitSetCC` delegates to `DAGCombiner::SimplifySetCC`, which is just a stub calling `TargetLowering::SimplifySetCC`. `TargetLowering::SimplifySetCC` adds nodes to the worklist directly and doesn't return an `SDValue` vector. If it's better to mimic `BuildSDIVPow2` and have a wrapper, I could: Have `TargetLowering::SimplifySetCC` call the wrapper from `DAGCombiner`, at the risk of this being slightly confusing (I don't believe anything else in that function does that?) Remove the call to `TargetLowering::BuildUREMEqFold` from `TargetLowering::SimplifySetCC` and add a call to the wrapper instead to `DAGCombiner::SimplifySetCC` or `DAGCombiner::visitSETCC`. Currently neither actually contains any case-specific simplification logic, so I'm not sure if I should change that.

In D50222#1244716, @hermord wrote:

@lebedev.ri Sorry, couldn't find any instances of 4 x i16 in the nonsplat tests; I think I only had those in the splat ones.

Hm, could be. I don't see them either now.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606 ↗	(On Diff #159505)	This current `BuildUREMEqFold()` does not `adds nodes to the worklist directly` as far as i can see? Well, i had the third variant in mind. Rename this function to something like `BuildUREMEqFold_REAL()` (bad name, just an example) Add a new wrapper `BuildUREMEqFold()` here in this file, right after `BuildUREMEqFold_REAL()`. So nothing changes for the callers. And the wrapper will do all that stuff about receiving the vector of newly created `SDNodes`, and adding them to the worklist.

Made a wrapper as dicussed.

hermord added inline comments.Sep 25 2018, 4:52 PM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606 ↗	(On Diff #159505)	It doesn't right now, this is true. I removed it following this comment: Comment at: lib/CodeGen/SelectionDAG/TargetLowering.cpp:3771 + SDValue Op1 = DAG.getNode(ISD::MUL, DL, REMVT, REMNode->getOperand(0), PVal); + DCI.AddToWorklist(Op1.getNode()); + // Will change in the signed case ---------------- I do not know if this is needed? At least rL338079 removed such a call. Looking back at rL338079, it doesn't say that `AddToWorklist` is never necessary in this context, and similar code (`BuildUDIV`) does do it, so I removing it might have been premature. I took a stab at the third variant in the latest diff.

Any futher blockers here?

RKSimon added inline comments.Oct 2 2018, 9:24 AM

test/CodeGen/AArch64/urem-seteq-vec-nonsplat.ll
6 ↗	(On Diff #167025)	You can commit the nonsplat test name changes as an NFC now to reduce this patch.
test/CodeGen/X86/jump_sign.ll
410	Have you made any progress working out what the problems is with test1?
test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
10 ↗	(On Diff #167025)	You can commit the nonsplat test name changes as an NFC now to reduce this patch.
test/CodeGen/X86/urem-seteq-vec-splat.ll
65	These <4 x i16> -> <8 x i16> test changes need to be done as an NFC commit, showing the current codegen and then this patch rebased. Its up to you if you keep the aarch64 using <4 x i16> or not but the x86 versions need to be changed to a legal type.

RKSimon mentioned this in rL343934: [AARCH64][X86] Remove _nonsplat from test names.Oct 7 2018, 4:25 AM

xbolva00 removed a reviewer: xbolva00.Oct 15 2018, 2:08 PM

xbolva00 removed a subscriber: xbolva00.

@hermord Are you still looking at this?

Btw, why we do not this transformation on IR-level at first?

spatel added a subscriber: spatel.Nov 24 2018, 7:43 AM

nikic added a subscriber: nikic.Nov 24 2018, 3:03 PM

@hermord Do you mind if I commandeer this to get it finished please?

@RKSimon I am sorry for the long radio silence. It is entirely my fault.

Regarding test1, the problem seems to be that once this fold is applied, producing something of the form (setule (mul t, -85), 85), it is difficult to deduce that this is the same as (seteq t, 0) in this case. In what was produced before, (seteq (sub t, (mul ...)), 0), a series of optimizations could figure out that the mul node is always zero.

This can be worked around in InstCombine without any additional KnownBits computations beyond what already happens. I'd looked into implementing a fix there before (mentioned in dicussion with @lebedev.ri) and could replicate it, but this will require running opt on jump_sign.ll to restore previous behavior.

I do not have commit rights, so I cannot enact the suggestion to commit the tests as NFC, but I can implement the above (or any other) approach for test1 now, if required, to finalize the patch. I feel like this is very close to completion, but if this is not acceptable, please feel free to commandeer the patch.

Ok, so create new revision with tests? Somebody will commit it for you.

hermord mentioned this in D56372: [NFC] Make vector types legal in UREM test.Jan 6 2019, 5:41 PM

Rebased on top of D56372.

RKSimon mentioned this in rL353004: [NFC] Make vector types legal in UREM test.Feb 3 2019, 11:38 AM

RKSimon mentioned this in rG135413d38156: [NFC] Make vector types legal in UREM test.

@hermord reverse ping

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 4:26 AM

Hi guys, I found the magical formula for unsigned integers that works also with even numbers without the need to check for overflows with any remainder:
from divisor d and reminder r, I calculate 4 constants.
any d!=0 should fit.

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;/* note that I divide by d, not d_odd */
}

A little bit background: the constant (u +1) is the count of the possible values in the range of 64 bit number that will yield the correct modulo.
The constant s should zero the first k bits if the given value have modulo of r. it will also zero the modulo of d_odd.

then the compiler should generate the following code with the given constants:

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}

this replace the following:

/* d is the divisor, r is the remainder */
int checkrem64(uint64_t x)
{
  return x%d==r;
}

this is the code to find modular inverse..

uint64_t find_mod_mul_inverse(uint64_t x, uint64_t bits)
{
      if (bits > 64 || ((x&1)==0))
              return 0;// invalid parameters
      uint64_t mask;
      if (bits == 64)
              mask = -1;
      else
      {                
              mask = 1;
              mask<<=bits;
              mask--;
      }
      x&=mask;
      uint64_t result=1, state=x, ctz=0;
      while(state!=1ULL)
      {
              ctz=__builtin_ctzll(state^1);
              result|=1ULL<<ctz;
              state+=x<<ctz;
              state&=mask;
      }
      return result;
}

good luck!
I tested this on all the cases of 10bit word size, and it passed.

*Edit:* I looked for something that will work for signed integers. I came up with something that would work with negative numbers if the following assumption was correct:

(-21)%10==9

but this assumption is not correct because (-21)%10 equals to -1.
anyway, the idea is like that, you shift the range and change s and accordingly:

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
     //this is the added line to make it work with signed integers
      r+=0x8000 0000 0000 0000% d;
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;
}

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
     //this is the added line to make it work with signed integers
//x came as signed number but was casted to unsigned
    x^=0x8000 0000 0000 0000;// this is addition simplified to xor, spaces for clarity.
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}

but there must be a way to tweak u and s to make it work on negative only numbers or positive only numbers.... it should be easy...
Can someone please contact me to do the math?
I need a background and explanation how the folding of unsigned numbers should act and then I will try to find for you the formula... dvoreader@gmail.com
@jdoerfert, please read this carefully, cause you did not calculated correctly the Q constant when D is even. that is why it did not worked for you when D was 4, and it will also not work for you when D is 6 or 10 etc.

This revision now requires changes to proceed.Feb 23 2019, 11:03 AM

Herald added a subscriber: jdoerfert. · View Herald TranscriptFeb 23 2019, 11:03 AM

Rebased. Moved to DAGCombiner and fixed regression in jump_sign.ll.

xbolva00 edited the summary of this revision. (Show Details)Jun 14 2019, 3:00 PM

Oh, no. Still there :((

xbolva00 abandoned this revision.Jun 14 2019, 3:27 PM

Hmm, i don't see the button to take the review,
so i had to post a new one: D63391
The test/CodeGen/X86/jump_sign.ll regression is rather trivial, and is being fixed by a followup patch D63390.

lebedev.ri mentioned this in rGcdd43eac4fe3: [Codegen] TargetLowering::SimplifySetCC(): omit urem when possible.Jun 25 2019, 3:05 AM

Diffusion mentioned this in rL364286: [Codegen] TargetLowering::SimplifySetCC(): omit urem when possible.Jun 25 2019, 3:07 AM

Diffusion mentioned this in rL364563: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case) (try 2).Jun 27 2019, 9:45 AM

lebedev.ri mentioned this in rG0627b09863b8: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case) (try 2).Jun 27 2019, 9:47 AM

Diffusion mentioned this in rL364600: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case) (try 3).Jun 27 2019, 2:52 PM

lebedev.ri mentioned this in rG29d05c005fa8: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case) (try 3).Jun 27 2019, 2:53 PM

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

136 lines

test/

CodeGen/

AArch64/

urem-seteq-vec-splat.ll

20 lines

urem-seteq.ll

82 lines

X86/

jump_sign.ll

17 lines

urem-seteq-vec-splat.ll

123 lines

urem-seteq.ll

141 lines

Diff 204851

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 481 Lines • ▼ Show 20 Lines	SDValue convertSelectOfFPConstantsToLoadOffset(
const SDLoc &DL, SDValue N0, SDValue N1, SDValue N2, SDValue N3,		const SDLoc &DL, SDValue N0, SDValue N1, SDValue N2, SDValue N3,
ISD::CondCode CC);		ISD::CondCode CC);
SDValue foldSelectCCToShiftAnd(const SDLoc &DL, SDValue N0, SDValue N1,		SDValue foldSelectCCToShiftAnd(const SDLoc &DL, SDValue N0, SDValue N1,
SDValue N2, SDValue N3, ISD::CondCode CC);		SDValue N2, SDValue N3, ISD::CondCode CC);
SDValue foldLogicOfSetCCs(bool IsAnd, SDValue N0, SDValue N1,		SDValue foldLogicOfSetCCs(bool IsAnd, SDValue N0, SDValue N1,
const SDLoc &DL);		const SDLoc &DL);
SDValue unfoldMaskedMerge(SDNode *N);		SDValue unfoldMaskedMerge(SDNode *N);
SDValue unfoldExtremeBitClearingToShifts(SDNode *N);		SDValue unfoldExtremeBitClearingToShifts(SDNode *N);
		SDValue simplifyREM(SDNode *N);
		SDValue simplifyUREMWithConstant(EVT VT, SDValue REMNode, SDValue CompNode,
		ISD::CondCode Cond, SelectionDAG &DAG,
		const SDLoc &DL);
SDValue SimplifySetCC(EVT VT, SDValue N0, SDValue N1, ISD::CondCode Cond,		SDValue SimplifySetCC(EVT VT, SDValue N0, SDValue N1, ISD::CondCode Cond,
const SDLoc &DL, bool foldBooleans);		const SDLoc &DL, bool foldBooleans);
SDValue rebuildSetCC(SDValue N);		SDValue rebuildSetCC(SDValue N);

bool isSetCCEquivalent(SDValue N, SDValue &LHS, SDValue &RHS,		bool isSetCCEquivalent(SDValue N, SDValue &LHS, SDValue &RHS,
SDValue &CC) const;		SDValue &CC) const;
bool isOneUseSetCC(SDValue N) const;		bool isOneUseSetCC(SDValue N) const;

▲ Show 20 Lines • Show All 3,282 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitUDIVLike(SDValue N0, SDValue N1, SDNode *N) {
if (isConstantOrConstantVector(N1) &&		if (isConstantOrConstantVector(N1) &&
!TLI.isIntDivCheap(N->getValueType(0), Attr))		!TLI.isIntDivCheap(N->getValueType(0), Attr))
if (SDValue Op = BuildUDIV(N))		if (SDValue Op = BuildUDIV(N))
return Op;		return Op;

return SDValue();		return SDValue();
}		}

// handles ISD::SREM and ISD::UREM		SDValue DAGCombiner::simplifyREM(SDNode *N) {
SDValue DAGCombiner::visitREM(SDNode *N) {
unsigned Opcode = N->getOpcode();		unsigned Opcode = N->getOpcode();
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
EVT CCVT = getSetCCResultType(VT);		EVT CCVT = getSetCCResultType(VT);

bool isSigned = (Opcode == ISD::SREM);		bool isSigned = (Opcode == ISD::SREM);
SDLoc DL(N);		SDLoc DL(N);
Show All 32 Lines	if (N1.getOpcode() == ISD::SHL &&
DAG.isKnownToBeAPowerOfTwo(N1.getOperand(0))) {		DAG.isKnownToBeAPowerOfTwo(N1.getOperand(0))) {
// fold (urem x, (shl pow2, y)) -> (and x, (add (shl pow2, y), -1))		// fold (urem x, (shl pow2, y)) -> (and x, (add (shl pow2, y), -1))
SDValue Add = DAG.getNode(ISD::ADD, DL, VT, N1, NegOne);		SDValue Add = DAG.getNode(ISD::ADD, DL, VT, N1, NegOne);
AddToWorklist(Add.getNode());		AddToWorklist(Add.getNode());
return DAG.getNode(ISD::AND, DL, VT, N0, Add);		return DAG.getNode(ISD::AND, DL, VT, N0, Add);
}		}
}		}

		// sdiv, srem -> sdivrem
		if (SDValue DivRem = useDivRem(N))
		return DivRem.getValue(1);

		return SDValue();
		}

		// handles ISD::SREM and ISD::UREM
		SDValue DAGCombiner::visitREM(SDNode *N) {
		unsigned Opcode = N->getOpcode();
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);
		EVT VT = N->getValueType(0);

		bool isSigned = (Opcode == ISD::SREM);
		SDLoc DL(N);

		if (SDValue V = simplifyREM(N))
		return V;

AttributeList Attr = DAG.getMachineFunction().getFunction().getAttributes();		AttributeList Attr = DAG.getMachineFunction().getFunction().getAttributes();
		lebedev.riUnsubmitted Not Done Reply Inline Actions I think this is independent from the rest? Can you split it into a separate review? lebedev.ri: I think this is independent from the rest? Can you split it into a separate review?
		lebedev.riUnsubmitted Not Done Reply Inline Actions You could put this fold into the middle-end, into instsimplify, if we don't do it already. But since it needs `computeKnownBits()`, some measurements would be needed (how many times does it fire on some codebase? possibly, measure compile-time impact) lebedev.ri: You could put this fold into the middle-end, into instsimplify, if we don't do it already. But…

// If X/C can be simplified by the division-by-constant logic, lower		// If X/C can be simplified by the division-by-constant logic, lower
// X%C to the equivalent of X-X/C*C.		// X%C to the equivalent of X-X/C*C.
// Reuse the SDIVLike/UDIVLike combines - to avoid mangling nodes, the		// Reuse the SDIVLike/UDIVLike combines - to avoid mangling nodes, the
// speculative DIV must not cause a DIVREM conversion. We guard against this		// speculative DIV must not cause a DIVREM conversion. We guard against this
		lebedev.riUnsubmitted Not Done Reply Inline Actions This feels slightly backwards. Can you do this in `DAGCombiner::visitSETCC()`, and thus avoid all this use-dance? lebedev.ri: This feels slightly backwards. Can you do this in `DAGCombiner::visitSETCC()`, and thus avoid…
// by skipping the simplification if isIntDivCheap(). When div is not cheap,		// by skipping the simplification if isIntDivCheap(). When div is not cheap,
// combine will not return a DIVREM. Regardless, checking cheapness here		// combine will not return a DIVREM. Regardless, checking cheapness here
// makes sense since the simplification results in fatter code.		// makes sense since the simplification results in fatter code.
if (DAG.isKnownNeverZero(N1) && !TLI.isIntDivCheap(VT, Attr)) {		if (DAG.isKnownNeverZero(N1) && !TLI.isIntDivCheap(VT, Attr)) {
SDValue OptimizedDiv =		SDValue OptimizedDiv =
isSigned ? visitSDIVLike(N0, N1, N) : visitUDIVLike(N0, N1, N);		isSigned ? visitSDIVLike(N0, N1, N) : visitUDIVLike(N0, N1, N);
if (OptimizedDiv.getNode()) {		if (OptimizedDiv.getNode()) {
// If the equivalent Div node also exists, update its users.		// If the equivalent Div node also exists, update its users.
unsigned DivOpcode = isSigned ? ISD::SDIV : ISD::UDIV;		unsigned DivOpcode = isSigned ? ISD::SDIV : ISD::UDIV;
if (SDNode *DivNode = DAG.getNodeIfExists(DivOpcode, N->getVTList(),		if (SDNode *DivNode = DAG.getNodeIfExists(DivOpcode, N->getVTList(),
{ N0, N1 }))		{ N0, N1 }))
CombineTo(DivNode, OptimizedDiv);		CombineTo(DivNode, OptimizedDiv);
SDValue Mul = DAG.getNode(ISD::MUL, DL, VT, OptimizedDiv, N1);		SDValue Mul = DAG.getNode(ISD::MUL, DL, VT, OptimizedDiv, N1);
SDValue Sub = DAG.getNode(ISD::SUB, DL, VT, N0, Mul);		SDValue Sub = DAG.getNode(ISD::SUB, DL, VT, N0, Mul);
AddToWorklist(OptimizedDiv.getNode());		AddToWorklist(OptimizedDiv.getNode());
AddToWorklist(Mul.getNode());		AddToWorklist(Mul.getNode());
return Sub;		return Sub;
}		}
}		}

// sdiv, srem -> sdivrem
if (SDValue DivRem = useDivRem(N))
return DivRem.getValue(1);

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitMULHS(SDNode *N) {		SDValue DAGCombiner::visitMULHS(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDLoc DL(N);		SDLoc DL(N);
▲ Show 20 Lines • Show All 3,902 Lines • ▼ Show 20 Lines	if (CondVT.isInteger() &&
if (VT.bitsEq(CondVT))		if (VT.bitsEq(CondVT))
return NotCond;		return NotCond;
return DAG.getZExtOrTrunc(NotCond, DL, VT);		return DAG.getZExtOrTrunc(NotCond, DL, VT);
}		}

return SDValue();		return SDValue();
}		}

		// See "Test for Zero Remainder after Division by a Constant" in
		// "Hacker's Delight" by Henry Warren
		SDValue DAGCombiner::simplifyUREMWithConstant(EVT VT, SDValue REMNode,
		SDValue CompNode,
		ISD::CondCode Cond,
		SelectionDAG &DAG,
		const SDLoc &DL) {
		// fold (seteq/ne (urem N, D), 0) -> (setule/ugt (rotr (mul N, P), K), Q)
		// - D must be constant with D = D0 * 2^K where D0 is odd and D0 != 1
		// - P is the multiplicative inverse of D0 modulo 2^W
		// - Q = floor((2^W - 1) / D0)
		// where W is the width of the common type of N and D
		assert((Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) &&
		"Only applicable for (in)equality comparisons.");

		if (SDValue V = simplifyREM(REMNode.getNode())) {
		if (V.getOpcode() == ISD::UREM)
		REMNode = V;
		else
		return DAG.getSetCC(DL, VT, V, CompNode, Cond);
		}

		EVT REMVT = REMNode->getValueType(0);
		if (!isTypeLegal(REMVT))
		return SDValue();

		// If MUL is unavailable, we cannot proceed in any case
		if (!TLI.isOperationLegalOrCustom(ISD::MUL, REMVT))
		return SDValue();

		// TODO: Add non-uniform constant support
		ConstantSDNode *Divisor = isConstOrConstSplat(REMNode->getOperand(1));
		ConstantSDNode *CompTarget = isConstOrConstSplat(CompNode);
		if (!Divisor \|\| !CompTarget \|\| Divisor->isNullValue() \|\|
		!CompTarget->isNullValue())
		return SDValue();

		const APInt &D = Divisor->getAPIntValue();

		// Decompose D into D0 * 2^K
		unsigned K = D.countTrailingZeros();
		bool DivisorIsEven = (K != 0);
		APInt D0 = D.lshr(K);

		// P = inv(D0, 2^W)
		// 2^W requires W + 1 bits, so we have to extend and then truncate
		unsigned W = D.getBitWidth();
		APInt P = D0.zext(W + 1)
		.multiplicativeInverse(APInt::getHighBitsSet(W + 1, 1))
		.trunc(W);

		// Q = floor((2^W - 1) / D0)
		APInt Q = APInt::getAllOnesValue(W).udiv(D0);

		SDValue PVal = DAG.getConstant(P, DL, REMVT);
		SDValue QVal = DAG.getConstant(Q, DL, REMVT);
		// (mul N, P)
		SDValue Mul = DAG.getNode(ISD::MUL, DL, REMVT, REMNode->getOperand(0), PVal);
		AddToWorklist(Mul.getNode());

		ISD::CondCode NewCC = (Cond == ISD::SETEQ) ? ISD::SETULE : ISD::SETUGT;

		// Rotate right only if D was even
		if (DivisorIsEven) {
		// We need ROTR to do this
		if (!TLI.isOperationLegalOrCustom(ISD::ROTR, REMVT))
		return SDValue();
		SDValue ShAmt =
		DAG.getConstant(K, DL, TLI.getShiftAmountTy(REMVT, DAG.getDataLayout()));
		SDNodeFlags Flags;
		Flags.setExact(true);
		// UREM: (rotr (mul N, P), K)
		SDValue Rotr = DAG.getNode(ISD::ROTR, DL, REMVT, Mul, ShAmt, Flags);
		AddToWorklist(Rotr.getNode());
		return DAG.getSetCC(DL, VT, Rotr, QVal, NewCC);
		} else {
		return DAG.getSetCC(DL, VT, Mul, QVal, NewCC);
		}
		}

SDValue DAGCombiner::visitSELECT(SDNode *N) {		SDValue DAGCombiner::visitSELECT(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
SDValue N2 = N->getOperand(2);		SDValue N2 = N->getOperand(2);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
EVT VT0 = N0.getValueType();		EVT VT0 = N0.getValueType();
SDLoc DL(N);		SDLoc DL(N);

▲ Show 20 Lines • Show All 752 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitSELECT_CC(SDNode *N) {
if (SimplifySelectOps(N, N2, N3))		if (SimplifySelectOps(N, N2, N3))
return SDValue(N, 0); // Don't revisit N.		return SDValue(N, 0); // Don't revisit N.

// fold select_cc into other things, such as min/max/abs		// fold select_cc into other things, such as min/max/abs
return SimplifySelectCC(SDLoc(N), N0, N1, N2, N3, CC);		return SimplifySelectCC(SDLoc(N), N0, N1, N2, N3, CC);
}		}

SDValue DAGCombiner::visitSETCC(SDNode *N) {		SDValue DAGCombiner::visitSETCC(SDNode *N) {
		EVT VT = N->getValueType(0);
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);
		SDValue N2 = N->getOperand(2);
		SDLoc DL(N);

		ISD::CondCode Cond = cast<CondCodeSDNode>(N2)->get();
		// Fold remainder of division by a a constant
		if (N0.getOpcode() == ISD::UREM && N0.hasOneUse() &&
		(Cond == ISD::SETEQ \|\| Cond == ISD::SETNE)) {
		AttributeList Attr = DAG.getMachineFunction().getFunction().getAttributes();

		// When division is cheap or optimizing for minimum size,
		// fall through to DIVREM creation by skipping this fold.
		if (!TLI.isIntDivCheap(VT, Attr) &&
		!Attr.hasFnAttribute(Attribute::MinSize))
		if (SDValue V = simplifyUREMWithConstant(VT, N0, N1, Cond, DAG, DL))
		return V;
		}

// setcc is very commonly used as an argument to brcond. This pattern		// setcc is very commonly used as an argument to brcond. This pattern
// also lend itself to numerous combines and, as a result, it is desired		// also lend itself to numerous combines and, as a result, it is desired
// we keep the argument to a brcond as a setcc as much as possible.		// we keep the argument to a brcond as a setcc as much as possible.
bool PreferSetCC =		bool PreferSetCC =
N->hasOneUse() && N->use_begin()->getOpcode() == ISD::BRCOND;		N->hasOneUse() && N->use_begin()->getOpcode() == ISD::BRCOND;

SDValue Combined = SimplifySetCC(		SDValue Combined = SimplifySetCC(VT, N0, N1, cast<CondCodeSDNode>(N2)->get(),
N->getValueType(0), N->getOperand(0), N->getOperand(1),		SDLoc(N), !PreferSetCC);
cast<CondCodeSDNode>(N->getOperand(2))->get(), SDLoc(N), !PreferSetCC);

if (!Combined)		if (!Combined)
return SDValue();		return SDValue();

// If we prefer to have a setcc, and we don't, we'll try our best to		// If we prefer to have a setcc, and we don't, we'll try our best to
// recreate one using rebuildSetCC.		// recreate one using rebuildSetCC.
if (PreferSetCC && Combined.getOpcode() != ISD::SETCC) {		if (PreferSetCC && Combined.getOpcode() != ISD::SETCC) {
SDValue NewSetCC = rebuildSetCC(Combined);		SDValue NewSetCC = rebuildSetCC(Combined);
▲ Show 20 Lines • Show All 11,154 Lines • ▼ Show 20 Lines
}		}

/// Given an ISD::UDIV node expressing a divide by constant, return a DAG		/// Given an ISD::UDIV node expressing a divide by constant, return a DAG
/// expression that will generate the same value by multiplying by a magic		/// expression that will generate the same value by multiplying by a magic
/// number.		/// number.
/// Ref: "Hacker's Delight" or "The PowerPC Compiler Writer's Guide".		/// Ref: "Hacker's Delight" or "The PowerPC Compiler Writer's Guide".
SDValue DAGCombiner::BuildUDIV(SDNode *N) {		SDValue DAGCombiner::BuildUDIV(SDNode *N) {
// when optimising for minimum size, we don't want to expand a div to a mul		// when optimising for minimum size, we don't want to expand a div to a mul
// and a shift.		// and a shift.
if (DAG.getMachineFunction().getFunction().hasMinSize())		if (DAG.getMachineFunction().getFunction().hasMinSize())
return SDValue();		return SDValue();
		lebedev.riUnsubmitted Not Done Reply Inline Actions Please format your changes with clang-format (you can automate that via git pre-commit hook) lebedev.ri: Please format your changes with clang-format (you can automate that via git pre-commit hook)
		lebedev.riUnsubmitted Not Done Reply Inline Actions Would be good to have a test (in a new file) for that. lebedev.ri: Would be good to have a test (in a new file) for that.

SmallVector<SDNode *, 8> Built;		SmallVector<SDNode *, 8> Built;
if (SDValue S = TLI.BuildUDIV(N, DAG, LegalOperations, Built)) {		if (SDValue S = TLI.BuildUDIV(N, DAG, LegalOperations, Built)) {
for (SDNode *N : Built)		for (SDNode *N : Built)
AddToWorklist(N);		AddToWorklist(N);
return S;		return S;
}		}
		lebedev.riUnsubmitted Not Done Reply Inline Actions SmallVector<SDNode , 8> Built; like in all the other functions. lebedev.ri:* ``` SmallVector<SDNode *, 8> Built; ``` like in all the other functions.

return SDValue();		return SDValue();
}		}
		lebedev.riUnsubmitted Not Done Reply Inline Actions Early return please. SDValue NewCond = TLI.BuildREMEqFold(REMNode, CondNode, DAG, Built); if (!NewCond) return SDValue(); lebedev.ri: Early return please. ``` SDValue NewCond = TLI.BuildREMEqFold(REMNode, CondNode, DAG, Built)…

/// Determines the LogBase2 value for a non-null input value using the		/// Determines the LogBase2 value for a non-null input value using the
/// transform: LogBase2(V) = (EltBits - 1) - ctlz(V).		/// transform: LogBase2(V) = (EltBits - 1) - ctlz(V).
SDValue DAGCombiner::BuildLogBase2(SDValue V, const SDLoc &DL) {		SDValue DAGCombiner::BuildLogBase2(SDValue V, const SDLoc &DL) {
EVT VT = V.getValueType();		EVT VT = V.getValueType();
unsigned EltBits = VT.getScalarSizeInBits();		unsigned EltBits = VT.getScalarSizeInBits();
		lebedev.riUnsubmitted Not Done Reply Inline Actions Hm. Why is this different from what `DAGCombiner::BuildUDIV()` does? I expected to see just the: for (SDNode N : Built) AddToWorklist(N); return NewCond; lebedev.ri:* Hm. Why is this different from what `DAGCombiner::BuildUDIV()` does? I expected to see just the…
SDValue Ctlz = DAG.getNode(ISD::CTLZ, DL, VT, V);		SDValue Ctlz = DAG.getNode(ISD::CTLZ, DL, VT, V);
SDValue Base = DAG.getConstant(EltBits - 1, DL, VT);		SDValue Base = DAG.getConstant(EltBits - 1, DL, VT);
SDValue LogBase2 = DAG.getNode(ISD::SUB, DL, VT, Base, Ctlz);		SDValue LogBase2 = DAG.getNode(ISD::SUB, DL, VT, Base, Ctlz);
return LogBase2;		return LogBase2;
}		}

/// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)		/// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)
/// For the reciprocal, we need to find the zero of the function:		/// For the reciprocal, we need to find the zero of the function:
▲ Show 20 Lines • Show All 631 Lines • Show Last 20 Lines

test/CodeGen/AArch64/urem-seteq-vec-splat.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s \| FileCheck %s

	; Tests BuildUREMEqFold for 4 x i32 splat vectors with odd divisor.			; Tests BuildUREMEqFold for 4 x i32 splat vectors with odd divisor.
	; See urem-seteq.ll for justification behind constants emitted.			; See urem-seteq.ll for justification behind constants emitted.
	define <4 x i32> @test_urem_odd_vec_i32(<4 x i32> %X) nounwind readnone {			define <4 x i32> @test_urem_odd_vec_i32(<4 x i32> %X) nounwind readnone {
	; CHECK-LABEL: test_urem_odd_vec_i32:			; CHECK-LABEL: test_urem_odd_vec_i32:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov w8, #52429			; CHECK-NEXT: mov w8, #52429
	; CHECK-NEXT: movk w8, #52428, lsl #16			; CHECK-NEXT: movk w8, #52428, lsl #16
	; CHECK-NEXT: dup v2.4s, w8			; CHECK-NEXT: dup v2.4s, w8
	; CHECK-NEXT: umull2 v3.2d, v0.4s, v2.4s			; CHECK-NEXT: movi v1.16b, #51
	; CHECK-NEXT: umull v2.2d, v0.2s, v2.2s			; CHECK-NEXT: mul v0.4s, v0.4s, v2.4s
	; CHECK-NEXT: uzp2 v2.4s, v2.4s, v3.4s			; CHECK-NEXT: cmhs v0.4s, v1.4s, v0.4s
	; CHECK-NEXT: movi v1.4s, #5
	; CHECK-NEXT: ushr v2.4s, v2.4s, #2
	; CHECK-NEXT: mls v0.4s, v2.4s, v1.4s
	; CHECK-NEXT: cmeq v0.4s, v0.4s, #0
	; CHECK-NEXT: movi v1.4s, #1			; CHECK-NEXT: movi v1.4s, #1
	; CHECK-NEXT: and v0.16b, v0.16b, v1.16b			; CHECK-NEXT: and v0.16b, v0.16b, v1.16b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%urem = urem <4 x i32> %X, <i32 5, i32 5, i32 5, i32 5>			%urem = urem <4 x i32> %X, <i32 5, i32 5, i32 5, i32 5>
	%cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>			%cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>
	%ret = zext <4 x i1> %cmp to <4 x i32>			%ret = zext <4 x i1> %cmp to <4 x i32>
	ret <4 x i32> %ret			ret <4 x i32> %ret
	}			}

	; Like test_urem_odd_vec_i32, but with 8 x i16 vectors.			; Like test_urem_odd_vec_i32, but with 8 x i16 vectors.
	define <8 x i16> @test_urem_odd_vec_i16(<8 x i16> %X) nounwind readnone {			define <8 x i16> @test_urem_odd_vec_i16(<8 x i16> %X) nounwind readnone {
	; CHECK-LABEL: test_urem_odd_vec_i16:			; CHECK-LABEL: test_urem_odd_vec_i16:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov w8, #52429			; CHECK-NEXT: mov w8, #52429
	; CHECK-NEXT: dup v2.8h, w8			; CHECK-NEXT: dup v2.8h, w8
	; CHECK-NEXT: umull2 v3.4s, v0.8h, v2.8h			; CHECK-NEXT: movi v1.16b, #51
	; CHECK-NEXT: umull v2.4s, v0.4h, v2.4h			; CHECK-NEXT: mul v0.8h, v0.8h, v2.8h
	; CHECK-NEXT: uzp2 v2.8h, v2.8h, v3.8h			; CHECK-NEXT: cmhs v0.8h, v1.8h, v0.8h
	; CHECK-NEXT: movi v1.8h, #5
	; CHECK-NEXT: ushr v2.8h, v2.8h, #2
	; CHECK-NEXT: mls v0.8h, v2.8h, v1.8h
	; CHECK-NEXT: cmeq v0.8h, v0.8h, #0
	; CHECK-NEXT: movi v1.8h, #1			; CHECK-NEXT: movi v1.8h, #1
	; CHECK-NEXT: and v0.16b, v0.16b, v1.16b			; CHECK-NEXT: and v0.16b, v0.16b, v1.16b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%urem = urem <8 x i16> %X, <i16 5, i16 5, i16 5, i16 5,			%urem = urem <8 x i16> %X, <i16 5, i16 5, i16 5, i16 5,
	i16 5, i16 5, i16 5, i16 5>			i16 5, i16 5, i16 5, i16 5>
	%cmp = icmp eq <8 x i16> %urem, <i16 0, i16 0, i16 0, i16 0,			%cmp = icmp eq <8 x i16> %urem, <i16 0, i16 0, i16 0, i16 0,
	i16 0, i16 0, i16 0, i16 0>			i16 0, i16 0, i16 0, i16 0>
	%ret = zext <8 x i1> %cmp to <8 x i16>			%ret = zext <8 x i1> %cmp to <8 x i16>
	▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

test/CodeGen/AArch64/urem-seteq.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s \| FileCheck %s

	; This tests the BuildREMEqFold optimization with UREM, i32, odd divisor, SETEQ.			; This tests the BuildREMEqFold optimization with UREM, i32, odd divisor, SETEQ.
	; The corresponding pseudocode is:			; The corresponding pseudocode is:
	; Q <- [N * multInv(5, 2^32)] <=> [N * 0xCCCCCCCD] <=> [N * (-858993459)]			; Q <- [N * multInv(5, 2^32)] <=> [N * 0xCCCCCCCD] <=> [N * (-858993459)]
	; res <- [Q <= (2^32 - 1) / 5] <=> [Q <= 858993459] <=> [Q < 858993460]			; res <- [Q <= (2^32 - 1) / 5] <=> [Q <= 858993459] <=> [Q < 858993460]
	define i32 @test_urem_odd(i32 %X) nounwind readnone {			define i32 @test_urem_odd(i32 %X) nounwind readnone {
	; CHECK-LABEL: test_urem_odd:			; CHECK-LABEL: test_urem_odd:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov w8, #52429			; CHECK-NEXT: mov w8, #52429
	; CHECK-NEXT: movk w8, #52428, lsl #16			; CHECK-NEXT: movk w8, #52428, lsl #16
	; CHECK-NEXT: umull x8, w0, w8			; CHECK-NEXT: mov w9, #13108
	; CHECK-NEXT: lsr x8, x8, #34			; CHECK-NEXT: mul w8, w0, w8
	; CHECK-NEXT: add w8, w8, w8, lsl #2			; CHECK-NEXT: movk w9, #13107, lsl #16
	; CHECK-NEXT: cmp w0, w8			; CHECK-NEXT: cmp w8, w9
	; CHECK-NEXT: cset w0, eq			; CHECK-NEXT: cset w0, lo
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%urem = urem i32 %X, 5			%urem = urem i32 %X, 5
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; This is like test_urem_odd, except the divisor has bit 30 set.			; This is like test_urem_odd, except the divisor has bit 30 set.
	define i32 @test_urem_odd_bit30(i32 %X) nounwind readnone {			define i32 @test_urem_odd_bit30(i32 %X) nounwind readnone {
	; CHECK-LABEL: test_urem_odd_bit30:			; CHECK-LABEL: test_urem_odd_bit30:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov w8, #-11			; CHECK-NEXT: mov w8, #43691
	; CHECK-NEXT: umull x8, w0, w8			; CHECK-NEXT: movk w8, #27306, lsl #16
	; CHECK-NEXT: mov w9, #3			; CHECK-NEXT: mul w8, w0, w8
	; CHECK-NEXT: lsr x8, x8, #62			; CHECK-NEXT: cmp w8, #4 // =4
	; CHECK-NEXT: movk w9, #16384, lsl #16			; CHECK-NEXT: cset w0, lo
	; CHECK-NEXT: msub w8, w8, w9, w0
	; CHECK-NEXT: cmp w8, #0 // =0
	; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%urem = urem i32 %X, 1073741827			%urem = urem i32 %X, 1073741827
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; This is like test_urem_odd, except the divisor has bit 31 set.			; This is like test_urem_odd, except the divisor has bit 31 set.
	define i32 @test_urem_odd_bit31(i32 %X) nounwind readnone {			define i32 @test_urem_odd_bit31(i32 %X) nounwind readnone {
	; CHECK-LABEL: test_urem_odd_bit31:			; CHECK-LABEL: test_urem_odd_bit31:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov w8, w0			; CHECK-NEXT: mov w8, #43691
	; CHECK-NEXT: lsl x9, x8, #30			; CHECK-NEXT: movk w8, #10922, lsl #16
	; CHECK-NEXT: sub x8, x9, x8			; CHECK-NEXT: mul w8, w0, w8
	; CHECK-NEXT: lsr x8, x8, #61			; CHECK-NEXT: cmp w8, #2 // =2
	; CHECK-NEXT: mov w9, #-2147483645			; CHECK-NEXT: cset w0, lo
	; CHECK-NEXT: msub w8, w8, w9, w0
	; CHECK-NEXT: cmp w8, #0 // =0
	; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%urem = urem i32 %X, 2147483651			%urem = urem i32 %X, 2147483651
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; This tests the BuildREMEqFold optimization with UREM, i16, even divisor, SETNE.			; This tests the BuildREMEqFold optimization with UREM, i16, even divisor, SETNE.
	; In this case, D <=> 14 <=> 7 * 2^1, so D0 = 7 and K = 1.			; In this case, D <=> 14 <=> 7 * 2^1, so D0 = 7 and K = 1.
	; The corresponding pseudocode is:			; The corresponding pseudocode is:
	; Q <- [N * multInv(D0, 2^16)] <=> [N * multInv(7, 2^16)] <=> [N * 28087]			; Q <- [N * multInv(D0, 2^16)] <=> [N * multInv(7, 2^16)] <=> [N * 28087]
	; Q <- [Q >>rot K] <=> [Q >>rot 1]			; Q <- [Q >>rot K] <=> [Q >>rot 1]
	; res <- ![Q <= (2^16 - 1) / 7] <=> ![Q <= 9362] <=> [Q > 9362]			; res <- ![Q <= (2^16 - 1) / 7] <=> ![Q <= 9362] <=> [Q > 9362]
	define i16 @test_urem_even(i16 %X) nounwind readnone {			define i16 @test_urem_even(i16 %X) nounwind readnone {
	; CHECK-LABEL: test_urem_even:			; CHECK-LABEL: test_urem_even:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov w10, #9363			; CHECK-NEXT: mov w9, #28087
	; CHECK-NEXT: ubfx w9, w0, #1, #15
	; CHECK-NEXT: movk w10, #37449, lsl #16
	; CHECK-NEXT: umull x9, w9, w10
	; CHECK-NEXT: and w8, w0, #0xffff			; CHECK-NEXT: and w8, w0, #0xffff
	; CHECK-NEXT: lsr x9, x9, #34			; CHECK-NEXT: movk w9, #46811, lsl #16
	; CHECK-NEXT: mov w10, #14			; CHECK-NEXT: mul w8, w8, w9
	; CHECK-NEXT: msub w8, w9, w10, w8			; CHECK-NEXT: mov w9, #18724
	; CHECK-NEXT: cmp w8, #0 // =0			; CHECK-NEXT: ror w8, w8, #1
	; CHECK-NEXT: cset w0, ne			; CHECK-NEXT: movk w9, #9362, lsl #16
				; CHECK-NEXT: cmp w8, w9
				; CHECK-NEXT: cset w0, hi
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%urem = urem i16 %X, 14			%urem = urem i16 %X, 14
	%cmp = icmp ne i16 %urem, 0			%cmp = icmp ne i16 %urem, 0
	%ret = zext i1 %cmp to i16			%ret = zext i1 %cmp to i16
	ret i16 %ret			ret i16 %ret
	}			}

	; This is like test_urem_even, except the divisor has bit 30 set.			; This is like test_urem_even, except the divisor has bit 30 set.
	define i32 @test_urem_even_bit30(i32 %X) nounwind readnone {			define i32 @test_urem_even_bit30(i32 %X) nounwind readnone {
	; CHECK-LABEL: test_urem_even_bit30:			; CHECK-LABEL: test_urem_even_bit30:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov w8, #-415			; CHECK-NEXT: mov w8, #20165
	; CHECK-NEXT: umull x8, w0, w8			; CHECK-NEXT: movk w8, #64748, lsl #16
	; CHECK-NEXT: mov w9, #104			; CHECK-NEXT: mul w8, w0, w8
	; CHECK-NEXT: lsr x8, x8, #62			; CHECK-NEXT: ror w8, w8, #3
	; CHECK-NEXT: movk w9, #16384, lsl #16			; CHECK-NEXT: cmp w8, #32 // =32
	; CHECK-NEXT: msub w8, w8, w9, w0			; CHECK-NEXT: cset w0, lo
	; CHECK-NEXT: cmp w8, #0 // =0
	; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%urem = urem i32 %X, 1073741928			%urem = urem i32 %X, 1073741928
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; This is like test_urem_odd, except the divisor has bit 31 set.			; This is like test_urem_odd, except the divisor has bit 31 set.
	define i32 @test_urem_even_bit31(i32 %X) nounwind readnone {			define i32 @test_urem_even_bit31(i32 %X) nounwind readnone {
	; CHECK-LABEL: test_urem_even_bit31:			; CHECK-LABEL: test_urem_even_bit31:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: mov w8, #65435			; CHECK-NEXT: mov w8, #64251
	; CHECK-NEXT: movk w8, #32767, lsl #16			; CHECK-NEXT: movk w8, #47866, lsl #16
	; CHECK-NEXT: umull x8, w0, w8			; CHECK-NEXT: mul w8, w0, w8
	; CHECK-NEXT: mov w9, #102			; CHECK-NEXT: ror w8, w8, #1
	; CHECK-NEXT: lsr x8, x8, #62			; CHECK-NEXT: cmp w8, #4 // =4
	; CHECK-NEXT: movk w9, #32768, lsl #16			; CHECK-NEXT: cset w0, lo
	; CHECK-NEXT: msub w8, w8, w9, w0
	; CHECK-NEXT: cmp w8, #0 // =0
	; CHECK-NEXT: cset w0, eq
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%urem = urem i32 %X, 2147483750			%urem = urem i32 %X, 2147483750
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; We should not proceed with this fold if the divisor is 1 or -1			; We should not proceed with this fold if the divisor is 1 or -1
	Show All 25 Lines

test/CodeGen/X86/jump_sign.ll

	Show First 20 Lines • Show All 230 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: testb %al, %al			; CHECK-NEXT: testb %al, %al
	; CHECK-NEXT: jne .LBB12_5			; CHECK-NEXT: jne .LBB12_5
	; CHECK-NEXT: # %bb.3: # %sw.bb			; CHECK-NEXT: # %bb.3: # %sw.bb
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: testb %al, %al			; CHECK-NEXT: testb %al, %al
	; CHECK-NEXT: jne .LBB12_8			; CHECK-NEXT: jne .LBB12_8
	; CHECK-NEXT: # %bb.4: # %if.end29			; CHECK-NEXT: # %bb.4: # %if.end29
	; CHECK-NEXT: movzwl (%eax), %eax			; CHECK-NEXT: movzwl (%eax), %eax
				; CHECK-NEXT: imull $-13107, %eax, %eax # imm = 0xCCCD
				; CHECK-NEXT: rorw %ax
	; CHECK-NEXT: movzwl %ax, %eax			; CHECK-NEXT: movzwl %ax, %eax
	; CHECK-NEXT: imull $52429, %eax, %ecx # imm = 0xCCCD			; CHECK-NEXT: cmpl $13108, %eax # imm = 0x3334
	; CHECK-NEXT: shrl $18, %ecx			; CHECK-NEXT: jae .LBB12_5
	; CHECK-NEXT: andl $-2, %ecx
	; CHECK-NEXT: leal (%ecx,%ecx,4), %ecx
	; CHECK-NEXT: cmpw %cx, %ax
	; CHECK-NEXT: jne .LBB12_5
	; CHECK-NEXT: .LBB12_8: # %if.then44			; CHECK-NEXT: .LBB12_8: # %if.then44
				lebedev.riUnsubmitted Not Done Reply Inline Actions What do you mean by breaks? Is this a miscompilation? lebedev.ri: What do you mean by breaks? Is this a miscompilation?
				hermordUnsubmitted Not Done Reply Inline Actions Oh my bad, to clarify: I was referring to `test1` in this file and not the one that's changed here. This one is fine, but the code produced for `test1` after this patch would be less optimal than before. The output of `BuildUREMEqFold` just doesn't seem to fall into patterns that we're already optimizing well in this particular case (`N` is extended from `i1`, then `AND`'ed and then passed into an UREM), and we'd probably have to do a new `KnownBits`-based optimization elsewhere to avoid this. Via `InstCombine` this can actually be fixed without close to no extra overhead because we are already computing the necessary bits anyway. hermord: Oh my bad, to clarify: I was referring to `test1` in this file and not the one that's changed…
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: testb %al, %al			; CHECK-NEXT: testb %al, %al
	; CHECK-NEXT: je .LBB12_9			; CHECK-NEXT: je .LBB12_9
	; CHECK-NEXT: # %bb.10: # %if.else.i104			; CHECK-NEXT: # %bb.10: # %if.else.i104
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	; CHECK-NEXT: .LBB12_5: # %sw.default			; CHECK-NEXT: .LBB12_5: # %sw.default
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: testb %al, %al			; CHECK-NEXT: testb %al, %al
	▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines

	; PR13966			; PR13966
	@b = common global i32 0, align 4			@b = common global i32 0, align 4
	@a = common global i32 0, align 4			@a = common global i32 0, align 4
	define i32 @func_test1(i32 %p1) nounwind uwtable {			define i32 @func_test1(i32 %p1) nounwind uwtable {
	; CHECK-LABEL: func_test1:			; CHECK-LABEL: func_test1:
	; CHECK: # %bb.0: # %entry			; CHECK: # %bb.0: # %entry
	; CHECK-NEXT: movl b, %eax			; CHECK-NEXT: movl b, %eax
				; CHECK-NEXT: xorl %ecx, %ecx
	; CHECK-NEXT: cmpl {{[0-9]+}}(%esp), %eax			; CHECK-NEXT: cmpl {{[0-9]+}}(%esp), %eax
	; CHECK-NEXT: setb %cl			; CHECK-NEXT: setb %cl
	; CHECK-NEXT: movl a, %eax			; CHECK-NEXT: movl a, %eax
	; CHECK-NEXT: testb %al, %cl			; CHECK-NEXT: andl %eax, %ecx
	; CHECK-NEXT: je .LBB18_2			; CHECK-NEXT: imull $-85, %ecx, %ecx
				; CHECK-NEXT: cmpb $86, %cl
				; CHECK-NEXT: jb .LBB18_2
	; CHECK-NEXT: # %bb.1: # %if.then			; CHECK-NEXT: # %bb.1: # %if.then
	; CHECK-NEXT: decl %eax			; CHECK-NEXT: decl %eax
	; CHECK-NEXT: movl %eax, a			; CHECK-NEXT: movl %eax, a
				hermordUnsubmitted Not Done Reply Inline Actions This is the `test1` regression I mentioned in previous updates. I've made it explicit now, for the lack of a clean fix that I can see (without `computeKnownBits`). hermord: This is the `test1` regression I mentioned in previous updates. I've made it explicit now, for…
				RKSimonUnsubmitted Not Done Reply Inline Actions Have you made any progress working out what the problems is with test1? RKSimon: Have you made any progress working out what the problems is with test1?
	; CHECK-NEXT: .LBB18_2: # %if.end			; CHECK-NEXT: .LBB18_2: # %if.end
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	entry:			entry:
	%t0 = load i32, i32* @b, align 4			%t0 = load i32, i32* @b, align 4
	%cmp = icmp ult i32 %t0, %p1			%cmp = icmp ult i32 %t0, %p1
	%conv = zext i1 %cmp to i32			%conv = zext i1 %cmp to i32
	%t1 = load i32, i32* @a, align 4			%t1 = load i32, i32* @a, align 4
	%and = and i32 %conv, %t1			%and = and i32 %conv, %t1
	Show All 16 Lines

test/CodeGen/X86/urem-seteq-vec-splat.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2 < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-SSE,CHECK-SSE2			; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2 < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-SSE,CHECK-SSE2
				RKSimonUnsubmitted Done Reply Inline Actions Run with avx2 as well for better test coverage. RKSimon: Run with avx2 as well for better test coverage.
	; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse4.1 < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-SSE,CHECK-SSE41			; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse4.1 < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-SSE,CHECK-SSE41
	; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+avx < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX,CHECK-AVX1			; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+avx < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX,CHECK-AVX1
	; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+avx2 < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX,CHECK-AVX2			; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+avx2 < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX,CHECK-AVX2
	; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+avx512f,+avx512vl < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX,CHECK-AVX512VL			; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+avx512f,+avx512vl < %s \| FileCheck %s --check-prefixes=CHECK,CHECK-AVX,CHECK-AVX512VL
	RKSimonUnsubmitted Done Reply Inline Actions Please add these test cases back. RKSimon: Please add these test cases back.

	; Tests BuildUREMEqFold for 4 x i32 splat vectors with odd divisor.			; Tests BuildUREMEqFold for 4 x i32 splat vectors with odd divisor.
	; See urem-seteq.ll for justification behind constants emitted.			; See urem-seteq.ll for justification behind constants emitted.
	define <4 x i32> @test_urem_odd_vec_i32(<4 x i32> %X) nounwind readnone {			define <4 x i32> @test_urem_odd_vec_i32(<4 x i32> %X) nounwind readnone {
	; CHECK-SSE2-LABEL: test_urem_odd_vec_i32:			; CHECK-SSE2-LABEL: test_urem_odd_vec_i32:
	; CHECK-SSE2: # %bb.0:			; CHECK-SSE2: # %bb.0:
	; CHECK-SSE2-NEXT: movdqa {{.*#+}} xmm1 = [3435973837,3435973837,3435973837,3435973837]			; CHECK-SSE2-NEXT: movdqa {{.*#+}} xmm1 = [3435973837,3435973837,3435973837,3435973837]
	; CHECK-SSE2-NEXT: movdqa %xmm0, %xmm2			; CHECK-SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm0[1,1,3,3]
				; CHECK-SSE2-NEXT: pmuludq %xmm1, %xmm0
				; CHECK-SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; CHECK-SSE2-NEXT: pmuludq %xmm1, %xmm2			; CHECK-SSE2-NEXT: pmuludq %xmm1, %xmm2
	; CHECK-SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm2[1,3,2,3]			; CHECK-SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm2[0,2,2,3]
	; CHECK-SSE2-NEXT: pshufd {{.*#+}} xmm3 = xmm0[1,1,3,3]			; CHECK-SSE2-NEXT: punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; CHECK-SSE2-NEXT: pmuludq %xmm1, %xmm3			; CHECK-SSE2-NEXT: pxor {{.*}}(%rip), %xmm0
	; CHECK-SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm3[1,3,2,3]			; CHECK-SSE2-NEXT: pcmpgtd {{.*}}(%rip), %xmm0
	; CHECK-SSE2-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]			; CHECK-SSE2-NEXT: pandn {{.*}}(%rip), %xmm0
	; CHECK-SSE2-NEXT: psrld $2, %xmm2
	; CHECK-SSE2-NEXT: movdqa %xmm2, %xmm1
	; CHECK-SSE2-NEXT: pslld $2, %xmm1
	; CHECK-SSE2-NEXT: paddd %xmm2, %xmm1
	; CHECK-SSE2-NEXT: psubd %xmm1, %xmm0
	; CHECK-SSE2-NEXT: pxor %xmm1, %xmm1
	; CHECK-SSE2-NEXT: pcmpeqd %xmm1, %xmm0
	; CHECK-SSE2-NEXT: psrld $31, %xmm0
	; CHECK-SSE2-NEXT: retq			; CHECK-SSE2-NEXT: retq
	;			;
	; CHECK-SSE41-LABEL: test_urem_odd_vec_i32:			; CHECK-SSE41-LABEL: test_urem_odd_vec_i32:
	; CHECK-SSE41: # %bb.0:			; CHECK-SSE41: # %bb.0:
	; CHECK-SSE41-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,3,3]			; CHECK-SSE41-NEXT: pmulld {{.*}}(%rip), %xmm0
				RKSimonUnsubmitted Done Reply Inline Actions Use legal types - 8 x i16 etc. RKSimon: Use legal types - 8 x i16 etc.
	; CHECK-SSE41-NEXT: movdqa {{.*#+}} xmm2 = [3435973837,3435973837,3435973837,3435973837]			; CHECK-SSE41-NEXT: movdqa {{.*#+}} xmm1 = [858993459,858993459,858993459,858993459]
	; CHECK-SSE41-NEXT: pmuludq %xmm2, %xmm1			; CHECK-SSE41-NEXT: pminud %xmm0, %xmm1
	; CHECK-SSE41-NEXT: pmuludq %xmm0, %xmm2
	; CHECK-SSE41-NEXT: pshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
	; CHECK-SSE41-NEXT: pblendw {{.*#+}} xmm2 = xmm2[0,1],xmm1[2,3],xmm2[4,5],xmm1[6,7]
	; CHECK-SSE41-NEXT: psrld $2, %xmm2
	; CHECK-SSE41-NEXT: pmulld {{.*}}(%rip), %xmm2
	; CHECK-SSE41-NEXT: psubd %xmm2, %xmm0
	; CHECK-SSE41-NEXT: pxor %xmm1, %xmm1
	; CHECK-SSE41-NEXT: pcmpeqd %xmm1, %xmm0			; CHECK-SSE41-NEXT: pcmpeqd %xmm1, %xmm0
	; CHECK-SSE41-NEXT: psrld $31, %xmm0			; CHECK-SSE41-NEXT: psrld $31, %xmm0
	; CHECK-SSE41-NEXT: retq			; CHECK-SSE41-NEXT: retq
	;			;
	; CHECK-AVX1-LABEL: test_urem_odd_vec_i32:			; CHECK-AVX1-LABEL: test_urem_odd_vec_i32:
	; CHECK-AVX1: # %bb.0:			; CHECK-AVX1: # %bb.0:
	; CHECK-AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,3,3]			; CHECK-AVX1-NEXT: vpmulld {{.*}}(%rip), %xmm0, %xmm0
	; CHECK-AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [3435973837,3435973837,3435973837,3435973837]			; CHECK-AVX1-NEXT: vpminud {{.*}}(%rip), %xmm0, %xmm1
	; CHECK-AVX1-NEXT: vpmuludq %xmm2, %xmm1, %xmm1
	; CHECK-AVX1-NEXT: vpmuludq %xmm2, %xmm0, %xmm2
	; CHECK-AVX1-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
	; CHECK-AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm2[0,1],xmm1[2,3],xmm2[4,5],xmm1[6,7]
	; CHECK-AVX1-NEXT: vpsrld $2, %xmm1, %xmm1
	; CHECK-AVX1-NEXT: vpmulld {{.*}}(%rip), %xmm1, %xmm1
	; CHECK-AVX1-NEXT: vpsubd %xmm1, %xmm0, %xmm0
	; CHECK-AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; CHECK-AVX1-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0			; CHECK-AVX1-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0
	; CHECK-AVX1-NEXT: vpsrld $31, %xmm0, %xmm0			; CHECK-AVX1-NEXT: vpsrld $31, %xmm0, %xmm0
	; CHECK-AVX1-NEXT: retq			; CHECK-AVX1-NEXT: retq
	;			;
	; CHECK-AVX2-LABEL: test_urem_odd_vec_i32:			; CHECK-AVX2-LABEL: test_urem_odd_vec_i32:
	; CHECK-AVX2: # %bb.0:			; CHECK-AVX2: # %bb.0:
	; CHECK-AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,3,3]			; CHECK-AVX2-NEXT: vpbroadcastd {{.*#+}} xmm1 = [3435973837,3435973837,3435973837,3435973837]
	; CHECK-AVX2-NEXT: vpbroadcastd {{.*#+}} xmm2 = [3435973837,3435973837,3435973837,3435973837]			; CHECK-AVX2-NEXT: vpmulld %xmm1, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: vpmuludq %xmm2, %xmm1, %xmm1			; CHECK-AVX2-NEXT: vpbroadcastd {{.*#+}} xmm1 = [858993459,858993459,858993459,858993459]
	; CHECK-AVX2-NEXT: vpmuludq %xmm2, %xmm0, %xmm2			; CHECK-AVX2-NEXT: vpminud %xmm1, %xmm0, %xmm1
	; CHECK-AVX2-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
	; CHECK-AVX2-NEXT: vpblendd {{.*#+}} xmm1 = xmm2[0],xmm1[1],xmm2[2],xmm1[3]
	; CHECK-AVX2-NEXT: vpsrld $2, %xmm1, %xmm1
	; CHECK-AVX2-NEXT: vpbroadcastd {{.*#+}} xmm2 = [5,5,5,5]
	; CHECK-AVX2-NEXT: vpmulld %xmm2, %xmm1, %xmm1
	; CHECK-AVX2-NEXT: vpsubd %xmm1, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; CHECK-AVX2-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0			; CHECK-AVX2-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: vpsrld $31, %xmm0, %xmm0			; CHECK-AVX2-NEXT: vpsrld $31, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: retq			; CHECK-AVX2-NEXT: retq
	;			;
	; CHECK-AVX512VL-LABEL: test_urem_odd_vec_i32:			; CHECK-AVX512VL-LABEL: test_urem_odd_vec_i32:
	; CHECK-AVX512VL: # %bb.0:			; CHECK-AVX512VL: # %bb.0:
	; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,3,3]			; CHECK-AVX512VL-NEXT: vpmulld {{.*}}(%rip){1to4}, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: vpbroadcastd {{.*#+}} xmm2 = [3435973837,3435973837,3435973837,3435973837]			; CHECK-AVX512VL-NEXT: vpminud {{.*}}(%rip){1to4}, %xmm0, %xmm1
	; CHECK-AVX512VL-NEXT: vpmuludq %xmm2, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpmuludq %xmm2, %xmm0, %xmm2
	; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[1,1,3,3]
	; CHECK-AVX512VL-NEXT: vpblendd {{.*#+}} xmm1 = xmm2[0],xmm1[1],xmm2[2],xmm1[3]
	; CHECK-AVX512VL-NEXT: vpsrld $2, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpmulld {{.*}}(%rip){1to4}, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpsubd %xmm1, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0			; CHECK-AVX512VL-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: vpsrld $31, %xmm0, %xmm0			; CHECK-AVX512VL-NEXT: vpsrld $31, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: retq			; CHECK-AVX512VL-NEXT: retq
	%urem = urem <4 x i32> %X, <i32 5, i32 5, i32 5, i32 5>			%urem = urem <4 x i32> %X, <i32 5, i32 5, i32 5, i32 5>
	%cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>			%cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>
	%ret = zext <4 x i1> %cmp to <4 x i32>			%ret = zext <4 x i1> %cmp to <4 x i32>
	ret <4 x i32> %ret			ret <4 x i32> %ret
	}			}

	; Like test_urem_odd_vec_i32, but with 8 x i16 vectors.			; Like test_urem_odd_vec_i32, but with 8 x i16 vectors.
				RKSimonUnsubmitted Done Reply Inline Actions These <4 x i16> -> <8 x i16> test changes need to be done as an NFC commit, showing the current codegen and then this patch rebased. Its up to you if you keep the aarch64 using <4 x i16> or not but the x86 versions need to be changed to a legal type. RKSimon: These <4 x i16> -> <8 x i16> test changes need to be done as an NFC commit, showing the current…
	define <8 x i16> @test_urem_odd_vec_i16(<8 x i16> %X) nounwind readnone {			define <8 x i16> @test_urem_odd_vec_i16(<8 x i16> %X) nounwind readnone {
	; CHECK-SSE-LABEL: test_urem_odd_vec_i16:			; CHECK-SSE2-LABEL: test_urem_odd_vec_i16:
	; CHECK-SSE: # %bb.0:			; CHECK-SSE2: # %bb.0:
	; CHECK-SSE-NEXT: movdqa {{.*#+}} xmm1 = [52429,52429,52429,52429,52429,52429,52429,52429]			; CHECK-SSE2-NEXT: pmullw {{.*}}(%rip), %xmm0
	; CHECK-SSE-NEXT: pmulhuw %xmm0, %xmm1			; CHECK-SSE2-NEXT: psubusw {{.*}}(%rip), %xmm0
	; CHECK-SSE-NEXT: psrlw $2, %xmm1			; CHECK-SSE2-NEXT: pxor %xmm1, %xmm1
	; CHECK-SSE-NEXT: pmullw {{.*}}(%rip), %xmm1			; CHECK-SSE2-NEXT: pcmpeqw %xmm1, %xmm0
	; CHECK-SSE-NEXT: psubw %xmm1, %xmm0			; CHECK-SSE2-NEXT: psrlw $15, %xmm0
	; CHECK-SSE-NEXT: pxor %xmm1, %xmm1			; CHECK-SSE2-NEXT: retq
	; CHECK-SSE-NEXT: pcmpeqw %xmm1, %xmm0			;
	; CHECK-SSE-NEXT: psrlw $15, %xmm0			; CHECK-SSE41-LABEL: test_urem_odd_vec_i16:
	; CHECK-SSE-NEXT: retq			; CHECK-SSE41: # %bb.0:
				; CHECK-SSE41-NEXT: pmullw {{.*}}(%rip), %xmm0
				; CHECK-SSE41-NEXT: movdqa {{.*#+}} xmm1 = [13107,13107,13107,13107,13107,13107,13107,13107]
				; CHECK-SSE41-NEXT: pminuw %xmm0, %xmm1
				; CHECK-SSE41-NEXT: pcmpeqw %xmm1, %xmm0
				; CHECK-SSE41-NEXT: psrlw $15, %xmm0
				; CHECK-SSE41-NEXT: retq
	;			;
	; CHECK-AVX-LABEL: test_urem_odd_vec_i16:			; CHECK-AVX-LABEL: test_urem_odd_vec_i16:
	; CHECK-AVX: # %bb.0:			; CHECK-AVX: # %bb.0:
	; CHECK-AVX-NEXT: vpmulhuw {{.*}}(%rip), %xmm0, %xmm1			; CHECK-AVX-NEXT: vpmullw {{.*}}(%rip), %xmm0, %xmm0
	; CHECK-AVX-NEXT: vpsrlw $2, %xmm1, %xmm1			; CHECK-AVX-NEXT: vpminuw {{.*}}(%rip), %xmm0, %xmm1
	; CHECK-AVX-NEXT: vpmullw {{.*}}(%rip), %xmm1, %xmm1
	; CHECK-AVX-NEXT: vpsubw %xmm1, %xmm0, %xmm0
	; CHECK-AVX-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; CHECK-AVX-NEXT: vpcmpeqw %xmm1, %xmm0, %xmm0			; CHECK-AVX-NEXT: vpcmpeqw %xmm1, %xmm0, %xmm0
	; CHECK-AVX-NEXT: vpsrlw $15, %xmm0, %xmm0			; CHECK-AVX-NEXT: vpsrlw $15, %xmm0, %xmm0
	; CHECK-AVX-NEXT: retq			; CHECK-AVX-NEXT: retq
	%urem = urem <8 x i16> %X, <i16 5, i16 5, i16 5, i16 5,			%urem = urem <8 x i16> %X, <i16 5, i16 5, i16 5, i16 5,
	i16 5, i16 5, i16 5, i16 5>			i16 5, i16 5, i16 5, i16 5>
	%cmp = icmp eq <8 x i16> %urem, <i16 0, i16 0, i16 0, i16 0,			%cmp = icmp eq <8 x i16> %urem, <i16 0, i16 0, i16 0, i16 0,
	i16 0, i16 0, i16 0, i16 0>			i16 0, i16 0, i16 0, i16 0>
	%ret = zext <8 x i1> %cmp to <8 x i16>			%ret = zext <8 x i1> %cmp to <8 x i16>
	▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
	; CHECK-AVX2-NEXT: vpsubd %xmm1, %xmm0, %xmm0			; CHECK-AVX2-NEXT: vpsubd %xmm1, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: vpxor %xmm1, %xmm1, %xmm1			; CHECK-AVX2-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; CHECK-AVX2-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0			; CHECK-AVX2-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: vpsrld $31, %xmm0, %xmm0			; CHECK-AVX2-NEXT: vpsrld $31, %xmm0, %xmm0
	; CHECK-AVX2-NEXT: retq			; CHECK-AVX2-NEXT: retq
	;			;
	; CHECK-AVX512VL-LABEL: test_urem_even_vec_i32:			; CHECK-AVX512VL-LABEL: test_urem_even_vec_i32:
	; CHECK-AVX512VL: # %bb.0:			; CHECK-AVX512VL: # %bb.0:
	; CHECK-AVX512VL-NEXT: vpsrld $1, %xmm0, %xmm1			; CHECK-AVX512VL-NEXT: vpmulld {{.*}}(%rip){1to4}, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm2 = xmm1[1,1,3,3]			; CHECK-AVX512VL-NEXT: vprord $1, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: vpbroadcastd {{.*#+}} xmm3 = [2454267027,2454267027,2454267027,2454267027]			; CHECK-AVX512VL-NEXT: vpminud {{.*}}(%rip){1to4}, %xmm0, %xmm1
	; CHECK-AVX512VL-NEXT: vpmuludq %xmm3, %xmm2, %xmm2
	; CHECK-AVX512VL-NEXT: vpmuludq %xmm3, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
	; CHECK-AVX512VL-NEXT: vpblendd {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3]
	; CHECK-AVX512VL-NEXT: vpsrld $2, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpmulld {{.*}}(%rip){1to4}, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpsubd %xmm1, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; CHECK-AVX512VL-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0			; CHECK-AVX512VL-NEXT: vpcmpeqd %xmm1, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: vpsrld $31, %xmm0, %xmm0			; CHECK-AVX512VL-NEXT: vpsrld $31, %xmm0, %xmm0
	; CHECK-AVX512VL-NEXT: retq			; CHECK-AVX512VL-NEXT: retq
	%urem = urem <4 x i32> %X, <i32 14, i32 14, i32 14, i32 14>			%urem = urem <4 x i32> %X, <i32 14, i32 14, i32 14, i32 14>
	%cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>			%cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>
	%ret = zext <4 x i1> %cmp to <4 x i32>			%ret = zext <4 x i1> %cmp to <4 x i32>
	ret <4 x i32> %ret			ret <4 x i32> %ret
	}			}
	▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

test/CodeGen/X86/urem-seteq.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=i686-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X86			; RUN: llc -mtriple=i686-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X86
	; RUN: llc -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X64			; RUN: llc -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X64

				lebedev.riUnsubmitted Not Done Reply Inline Actions Actually, looking at the check-lines, i'm not sure we want to check the bmi2 version. I'm not seeing anything there that would benefit from anything above baseline, i think. lebedev.ri: Actually, looking at the check-lines, i'm not sure we want to check the bmi2 version. I'm not…
				lebedev.riUnsubmitted Not Done Reply Inline Actions I've meant to drop this check-line, but apparently forgot. I'm not 100% sure if we want/don't want it. lebedev.ri: I've meant to drop this check-line, but apparently forgot. I'm not 100% sure if we want/don't…
				hermordUnsubmitted Not Done Reply Inline Actions Should I drop it? On a related note, I could add `AVX2` to the vector tests on `X86` if that's likely to be useful. hermord: Should I drop it? On a related note, I could add `AVX2` to the vector tests on `X86` if that's…
				lebedev.riUnsubmitted Not Done Reply Inline Actions Right, good idea. I checked all the various sse/avx versions, and kept the unique ones that make a difference here. lebedev.ri: Right, good idea. I checked all the various sse/avx versions, and kept the unique ones that…
	; This tests the BuildREMEqFold optimization with UREM, i32, odd divisor, SETEQ.			; This tests the BuildREMEqFold optimization with UREM, i32, odd divisor, SETEQ.
	; The corresponding pseudocode is:			; The corresponding pseudocode is:
	; Q <- [N * multInv(5, 2^32)] <=> [N * 0xCCCCCCCD] <=> [N * (-858993459)]			; Q <- [N * multInv(5, 2^32)] <=> [N * 0xCCCCCCCD] <=> [N * (-858993459)]
	; res <- [Q <= (2^32 - 1) / 5] <=> [Q <= 858993459] <=> [Q < 858993460]			; res <- [Q <= (2^32 - 1) / 5] <=> [Q <= 858993459] <=> [Q < 858993460]
	define i32 @test_urem_odd(i32 %X) nounwind readnone {			define i32 @test_urem_odd(i32 %X) nounwind readnone {
	; X86-LABEL: test_urem_odd:			; X86-LABEL: test_urem_odd:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: imull $-858993459, {{[0-9]+}}(%esp), %ecx # imm = 0xCCCCCCCD
	; X86-NEXT: movl $-858993459, %edx # imm = 0xCCCCCCCD
	; X86-NEXT: movl %ecx, %eax
	; X86-NEXT: mull %edx
	; X86-NEXT: shrl $2, %edx
	; X86-NEXT: leal (%edx,%edx,4), %edx
	; X86-NEXT: xorl %eax, %eax			; X86-NEXT: xorl %eax, %eax
	; X86-NEXT: cmpl %edx, %ecx			; X86-NEXT: cmpl $858993460, %ecx # imm = 0x33333334
	; X86-NEXT: sete %al			; X86-NEXT: setb %al
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test_urem_odd:			; X64-LABEL: test_urem_odd:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: movl %edi, %eax			; X64-NEXT: imull $-858993459, %edi, %ecx # imm = 0xCCCCCCCD
	; X64-NEXT: movl $3435973837, %ecx # imm = 0xCCCCCCCD
	; X64-NEXT: imulq %rax, %rcx
	; X64-NEXT: shrq $34, %rcx
	; X64-NEXT: leal (%rcx,%rcx,4), %ecx
	; X64-NEXT: xorl %eax, %eax			; X64-NEXT: xorl %eax, %eax
	; X64-NEXT: cmpl %ecx, %edi			; X64-NEXT: cmpl $858993460, %ecx # imm = 0x33333334
	; X64-NEXT: sete %al			; X64-NEXT: setb %al
	; X64-NEXT: retq			; X64-NEXT: retq
	%urem = urem i32 %X, 5			%urem = urem i32 %X, 5
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; This is like test_urem_odd, except the divisor has bit 30 set.			; This is like test_urem_odd, except the divisor has bit 30 set.
	define i32 @test_urem_odd_bit30(i32 %X) nounwind readnone {			define i32 @test_urem_odd_bit30(i32 %X) nounwind readnone {
	; X86-LABEL: test_urem_odd_bit30:			; X86-LABEL: test_urem_odd_bit30:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: imull $1789569707, {{[0-9]+}}(%esp), %ecx # imm = 0x6AAAAAAB
	; X86-NEXT: movl $-11, %edx
	; X86-NEXT: movl %ecx, %eax
	; X86-NEXT: mull %edx
	; X86-NEXT: shrl $30, %edx
	; X86-NEXT: imull $1073741827, %edx, %edx # imm = 0x40000003
	; X86-NEXT: xorl %eax, %eax			; X86-NEXT: xorl %eax, %eax
	; X86-NEXT: cmpl %edx, %ecx			; X86-NEXT: cmpl $4, %ecx
	; X86-NEXT: sete %al			; X86-NEXT: setb %al
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test_urem_odd_bit30:			; X64-LABEL: test_urem_odd_bit30:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: movl %edi, %eax			; X64-NEXT: imull $1789569707, %edi, %ecx # imm = 0x6AAAAAAB
	; X64-NEXT: movl $4294967285, %ecx # imm = 0xFFFFFFF5
	; X64-NEXT: imulq %rax, %rcx
	; X64-NEXT: shrq $62, %rcx
	; X64-NEXT: imull $1073741827, %ecx, %ecx # imm = 0x40000003
	; X64-NEXT: xorl %eax, %eax			; X64-NEXT: xorl %eax, %eax
	; X64-NEXT: cmpl %ecx, %edi			; X64-NEXT: cmpl $4, %ecx
	; X64-NEXT: sete %al			; X64-NEXT: setb %al
	; X64-NEXT: retq			; X64-NEXT: retq
	%urem = urem i32 %X, 1073741827			%urem = urem i32 %X, 1073741827
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; This is like test_urem_odd, except the divisor has bit 31 set.			; This is like test_urem_odd, except the divisor has bit 31 set.
	define i32 @test_urem_odd_bit31(i32 %X) nounwind readnone {			define i32 @test_urem_odd_bit31(i32 %X) nounwind readnone {
	; X86-LABEL: test_urem_odd_bit31:			; X86-LABEL: test_urem_odd_bit31:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: imull $715827883, {{[0-9]+}}(%esp), %ecx # imm = 0x2AAAAAAB
	; X86-NEXT: movl $1073741823, %edx # imm = 0x3FFFFFFF
	; X86-NEXT: movl %ecx, %eax
	; X86-NEXT: mull %edx
	; X86-NEXT: shrl $29, %edx
	; X86-NEXT: imull $-2147483645, %edx, %edx # imm = 0x80000003
	; X86-NEXT: xorl %eax, %eax			; X86-NEXT: xorl %eax, %eax
	; X86-NEXT: cmpl %edx, %ecx			; X86-NEXT: cmpl $2, %ecx
	; X86-NEXT: sete %al			; X86-NEXT: setb %al
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test_urem_odd_bit31:			; X64-LABEL: test_urem_odd_bit31:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: movl %edi, %eax			; X64-NEXT: imull $715827883, %edi, %ecx # imm = 0x2AAAAAAB
	; X64-NEXT: movq %rax, %rcx
	; X64-NEXT: shlq $30, %rcx
	; X64-NEXT: subq %rax, %rcx
	; X64-NEXT: shrq $61, %rcx
	; X64-NEXT: imull $-2147483645, %ecx, %ecx # imm = 0x80000003
	; X64-NEXT: xorl %eax, %eax			; X64-NEXT: xorl %eax, %eax
	; X64-NEXT: cmpl %ecx, %edi			; X64-NEXT: cmpl $2, %ecx
	; X64-NEXT: sete %al			; X64-NEXT: setb %al
	; X64-NEXT: retq			; X64-NEXT: retq
	%urem = urem i32 %X, 2147483651			%urem = urem i32 %X, 2147483651
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; This tests the BuildREMEqFold optimization with UREM, i16, even divisor, SETNE.			; This tests the BuildREMEqFold optimization with UREM, i16, even divisor, SETNE.
	; In this case, D <=> 14 <=> 7 * 2^1, so D0 = 7 and K = 1.			; In this case, D <=> 14 <=> 7 * 2^1, so D0 = 7 and K = 1.
	; The corresponding pseudocode is:			; The corresponding pseudocode is:
	; Q <- [N * multInv(D0, 2^16)] <=> [N * multInv(7, 2^16)] <=> [N * 28087]			; Q <- [N * multInv(D0, 2^16)] <=> [N * multInv(7, 2^16)] <=> [N * 28087]
	; Q <- [Q >>rot K] <=> [Q >>rot 1]			; Q <- [Q >>rot K] <=> [Q >>rot 1]
	; res <- ![Q <= (2^16 - 1) / 7] <=> ![Q <= 9362] <=> [Q > 9362]			; res <- ![Q <= (2^16 - 1) / 7] <=> ![Q <= 9362] <=> [Q > 9362]
	define i16 @test_urem_even(i16 %X) nounwind readnone {			define i16 @test_urem_even(i16 %X) nounwind readnone {
	; X86-LABEL: test_urem_even:			; X86-LABEL: test_urem_even:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movzwl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: imull $28087, {{[0-9]+}}(%esp), %eax # imm = 0x6DB7
	; X86-NEXT: movl %ecx, %eax			; X86-NEXT: rorw %ax
	; X86-NEXT: shrl %eax			; X86-NEXT: movzwl %ax, %ecx
	; X86-NEXT: imull $18725, %eax, %eax # imm = 0x4925
	; X86-NEXT: shrl $17, %eax
	; X86-NEXT: movl %eax, %edx
	; X86-NEXT: shll $4, %edx
	; X86-NEXT: subl %eax, %edx
	; X86-NEXT: subl %eax, %edx
	; X86-NEXT: xorl %eax, %eax			; X86-NEXT: xorl %eax, %eax
	; X86-NEXT: cmpw %dx, %cx			; X86-NEXT: cmpl $9362, %ecx # imm = 0x2492
	; X86-NEXT: setne %al			; X86-NEXT: seta %al
	; X86-NEXT: # kill: def $ax killed $ax killed $eax			; X86-NEXT: # kill: def $ax killed $ax killed $eax
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test_urem_even:			; X64-LABEL: test_urem_even:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: movzwl %di, %ecx			; X64-NEXT: imull $28087, %edi, %eax # imm = 0x6DB7
	; X64-NEXT: movl %ecx, %eax			; X64-NEXT: rorw %ax
	; X64-NEXT: shrl %eax			; X64-NEXT: movzwl %ax, %ecx
	; X64-NEXT: imull $18725, %eax, %eax # imm = 0x4925
	; X64-NEXT: shrl $17, %eax
	; X64-NEXT: movl %eax, %edx
	; X64-NEXT: shll $4, %edx
	; X64-NEXT: subl %eax, %edx
	; X64-NEXT: subl %eax, %edx
	; X64-NEXT: xorl %eax, %eax			; X64-NEXT: xorl %eax, %eax
	; X64-NEXT: cmpw %dx, %cx			; X64-NEXT: cmpl $9362, %ecx # imm = 0x2492
	; X64-NEXT: setne %al			; X64-NEXT: seta %al
	; X64-NEXT: # kill: def $ax killed $ax killed $eax			; X64-NEXT: # kill: def $ax killed $ax killed $eax
	; X64-NEXT: retq			; X64-NEXT: retq
	%urem = urem i16 %X, 14			%urem = urem i16 %X, 14
	%cmp = icmp ne i16 %urem, 0			%cmp = icmp ne i16 %urem, 0
	%ret = zext i1 %cmp to i16			%ret = zext i1 %cmp to i16
	ret i16 %ret			ret i16 %ret
	}			}

	; This is like test_urem_even, except the divisor has bit 30 set.			; This is like test_urem_even, except the divisor has bit 30 set.
	define i32 @test_urem_even_bit30(i32 %X) nounwind readnone {			define i32 @test_urem_even_bit30(i32 %X) nounwind readnone {
	; X86-LABEL: test_urem_even_bit30:			; X86-LABEL: test_urem_even_bit30:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: imull $-51622203, {{[0-9]+}}(%esp), %ecx # imm = 0xFCEC4EC5
	; X86-NEXT: movl $-415, %edx # imm = 0xFE61			; X86-NEXT: rorl $3, %ecx
	; X86-NEXT: movl %ecx, %eax
	; X86-NEXT: mull %edx
	; X86-NEXT: shrl $30, %edx
	; X86-NEXT: imull $1073741928, %edx, %edx # imm = 0x40000068
	; X86-NEXT: xorl %eax, %eax			; X86-NEXT: xorl %eax, %eax
	; X86-NEXT: cmpl %edx, %ecx			; X86-NEXT: cmpl $32, %ecx
	; X86-NEXT: sete %al			; X86-NEXT: setb %al
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test_urem_even_bit30:			; X64-LABEL: test_urem_even_bit30:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: movl %edi, %eax			; X64-NEXT: imull $-51622203, %edi, %ecx # imm = 0xFCEC4EC5
	; X64-NEXT: movl $4294966881, %ecx # imm = 0xFFFFFE61			; X64-NEXT: rorl $3, %ecx
	; X64-NEXT: imulq %rax, %rcx
	; X64-NEXT: shrq $62, %rcx
	; X64-NEXT: imull $1073741928, %ecx, %ecx # imm = 0x40000068
	; X64-NEXT: xorl %eax, %eax			; X64-NEXT: xorl %eax, %eax
	; X64-NEXT: cmpl %ecx, %edi			; X64-NEXT: cmpl $32, %ecx
	; X64-NEXT: sete %al			; X64-NEXT: setb %al
	; X64-NEXT: retq			; X64-NEXT: retq
	%urem = urem i32 %X, 1073741928			%urem = urem i32 %X, 1073741928
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; This is like test_urem_odd, except the divisor has bit 31 set.			; This is like test_urem_odd, except the divisor has bit 31 set.
	define i32 @test_urem_even_bit31(i32 %X) nounwind readnone {			define i32 @test_urem_even_bit31(i32 %X) nounwind readnone {
	; X86-LABEL: test_urem_even_bit31:			; X86-LABEL: test_urem_even_bit31:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-NEXT: imull $-1157956869, {{[0-9]+}}(%esp), %ecx # imm = 0xBAFAFAFB
	; X86-NEXT: movl $2147483547, %edx # imm = 0x7FFFFF9B			; X86-NEXT: rorl %ecx
	; X86-NEXT: movl %ecx, %eax
	; X86-NEXT: mull %edx
	; X86-NEXT: shrl $30, %edx
	; X86-NEXT: imull $-2147483546, %edx, %edx # imm = 0x80000066
	; X86-NEXT: xorl %eax, %eax			; X86-NEXT: xorl %eax, %eax
	; X86-NEXT: cmpl %edx, %ecx			; X86-NEXT: cmpl $4, %ecx
	; X86-NEXT: sete %al			; X86-NEXT: setb %al
	; X86-NEXT: retl			; X86-NEXT: retl
	;			;
	; X64-LABEL: test_urem_even_bit31:			; X64-LABEL: test_urem_even_bit31:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: movl %edi, %eax			; X64-NEXT: imull $-1157956869, %edi, %ecx # imm = 0xBAFAFAFB
	; X64-NEXT: imulq $2147483547, %rax, %rax # imm = 0x7FFFFF9B			; X64-NEXT: rorl %ecx
	; X64-NEXT: shrq $62, %rax
	; X64-NEXT: imull $-2147483546, %eax, %ecx # imm = 0x80000066
	; X64-NEXT: xorl %eax, %eax			; X64-NEXT: xorl %eax, %eax
	; X64-NEXT: cmpl %ecx, %edi			; X64-NEXT: cmpl $4, %ecx
	; X64-NEXT: sete %al			; X64-NEXT: setb %al
	; X64-NEXT: retq			; X64-NEXT: retq
	%urem = urem i32 %X, 2147483750			%urem = urem i32 %X, 2147483750
	%cmp = icmp eq i32 %urem, 0			%cmp = icmp eq i32 %urem, 0
	%ret = zext i1 %cmp to i32			%ret = zext i1 %cmp to i32
	ret i32 %ret			ret i32 %ret
	}			}

	; We should not proceed with this fold if the divisor is 1 or -1			; We should not proceed with this fold if the divisor is 1 or -1
	Show All 33 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 204851

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/AArch64/urem-seteq-vec-splat.ll

test/CodeGen/AArch64/urem-seteq.ll

test/CodeGen/X86/jump_sign.ll

test/CodeGen/X86/urem-seteq-vec-splat.ll

test/CodeGen/X86/urem-seteq.ll

[CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)
AbandonedPublic