This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
2/3
TargetLowering.h
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
25/42
TargetLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
4
jump_sign.ll
-
rem-seteq-optsize.ll
1
rem-seteq.ll
-
vselect-avx.ll

Differential D50222

[CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)
AbandonedPublic

Authored by xbolva00 on Aug 2 2018, 8:58 PM.

Download Raw Diff

Details

Reviewers

MatzeB
kparzysz
efriedma
craig.topper
foad
RKSimon
javed.absar
lebedev.ri
gnudles
hermord

Summary

This implements an optimization described in Hacker's Delight 10-17: when C is constant, the result of X % C == 0 can be computed more cheaply without actually calculating the remainder. The motivation is discussed here: https://bugs.llvm.org/show_bug.cgi?id=35479.

Original patch author (Dmytro Shynkevych) Notes:

In principle, it's possible to also handle the X % C1 == C2 case, as discussed on bugzilla. This seems to require an extra branch on overflow, so I refrained from implementing this for now.
An explicit check for when the REM can be reduced to just its LHS is included: the X % C == 0 optimization breaks test1 in test/CodeGen/X86/jump_sign.ll otherwise. I hadn't managed to find a better way to not generate worse output in this case.
I haven't contributed to LLVM before, so I tried to select reviewers based on who I saw in other reviews. In particular, @kparzysz: a Hexagon test is modified and I have no familiarity with the architecture; hopefully my changes are valid.

Diff Detail

Repository: rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

hermord added inline comments.Aug 2 2018, 9:07 PM

test/CodeGen/Hexagon/swp-const-tc2.ll
9 ↗	(On Diff #158905)	To the best of my understanding, it is valid to not perform the multiplication at all, since its can never be used because of the infinite loop at `b3`. This is the output I get: // %bb.0: // %b0 { r0 = ##-1431655765 } { r0 = #0 } { r0 = memw(r0+#0) } .p2align 4 .LBB0_1: // %b3 // =>This Inner Loop Header: Depth=1 { jump .LBB0_1 }

hermord added inline comments.Aug 2 2018, 9:08 PM

test/CodeGen/Hexagon/swp-const-tc2.ll
9 ↗	(On Diff #158905)	*its result can never be used

hermord added a reviewer: foad.Aug 2 2018, 9:12 PM

lebedev.ri added a subscriber: lebedev.ri.Aug 3 2018, 12:07 AM

xbolva00 edited reviewers, added: craig.topper; removed: foad.Aug 3 2018, 3:31 AM

xbolva00 added a subscriber: xbolva00.

xbolva00 added reviewers: foad, RKSimon.Aug 3 2018, 3:36 AM

RKSimon added inline comments.Aug 3 2018, 4:14 AM

include/llvm/CodeGen/TargetLowering.h
3511	Please can you not include the Divisor - we are aiming for full vector support for divisions, not just splats - see D50185

kparzysz added inline comments.Aug 3 2018, 12:34 PM

test/CodeGen/Hexagon/swp-const-tc2.ll

9 ↗

(On Diff #158905)

Can you make these changes instead?

-define void @f0() {
+define i32 @f0(i32* %a0) {
 b0:
   br label %b1
 
 b1:                                               ; preds = %b1, %b0
   %v0 = phi i32 [ 0, %b0 ], [ %v9, %b1 ]
   %v1 = phi i32 [ 0, %b0 ], [ %v8, %b1 ]
-  %v2 = load i32, i32* undef, align 4
+  %v2 = load i32, i32* %a0, align 4
   %v3 = add nsw i32 %v1, 1
   %v4 = srem i32 %v2, 3
   %v5 = icmp ne i32 %v4, 0
   %v6 = sub nsw i32 0, %v2
   %v7 = select i1 %v5, i32 %v6, i32 %v2
   %v8 = mul nsw i32 %v3, %v7
   %v9 = add nsw i32 %v0, 1
   %v10 = icmp eq i32 %v9, 1
   br i1 %v10, label %b2, label %b1
 
 b2:                                               ; preds = %b1
   %v11 = phi i32 [ %v8, %b1 ]
   br label %b3
 
 b3:                                               ; preds = %b3, %b2
-  br label %b3
+  ret i32 %v11
 }

Made changes as requested.

majnemer added a subscriber: majnemer.Aug 4 2018, 10:35 PM

majnemer added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3750	Is this supposed to be ashr even if we are doing UREM? Seems a little unusual but if this is correct, it probably earns a comment.

@majnemer Thanks; this was, in fact, incorrect. Now, to simplify the logic, the absolute value of D is taken and lshr is used.

In D50222#1188879, @hermord wrote:

@majnemer Thanks; this was, in fact, incorrect. Now, to simplify the logic, the absolute value of D is taken and lshr is used.

There were non-NFC changes to the code, but there was no test changes at all.
I suspect the test coverage is seriously lacking.

majnemer added inline comments.Aug 5 2018, 2:39 PM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3747–3754	The sign bit may be set for APInts fed into URem operations: all isNegative does is check that the MSB is set. I don't think it is OK to branch on the sign bit here without considering if we are in dealing with URem or SRem. I think what you want is: if (IsSigned) { D0 = D.ashr(K); } else { D0 = D.lshr(K); }

As pointed out by @lebedev.ri, tests were inadequate. Added new tests and stepped through existing ones, adding descriptions. This uncovered two more bugs, which were also fixed here (an extraneous Q.lshrInPlace() and division by D instead of D0). As far as I can see, the coverage now seems reasonable; please point out any cases I missed.

Regarding @majnemer's suggestion, my bad: I should have mentioned in the last update that this algorithm actually assumes that D is positive ("because d is odd and, as we are assuming, positive and not equal to 1..." [Hacker's Delight, p.227]). It probably can be adapted to the D < 0 case with some tweaks, but it seems easier to just take the absolute value of D since this does not change the answer.

nickdesaulniers added a subscriber: nickdesaulniers.Aug 7 2018, 2:10 PM

A few passing-by thoughts.

include/llvm/CodeGen/TargetLowering.h
3511	This should be `SmallVectorImpl<SDNode *> &Created` like in all the other cases.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3372–3392 ↗	(On Diff #159505)	I think this is independent from the rest? Can you split it into a separate review?
3394–3397 ↗	(On Diff #159505)	This feels slightly backwards. Can you do this in `DAGCombiner::visitSETCC()`, and thus avoid all this use-dance?
18133–18135 ↗	(On Diff #159505)	Would be good to have a test (in a new file) for that.
18135 ↗	(On Diff #159505)	Please format your changes with clang-format (you can automate that via git pre-commit hook)
18142 ↗	(On Diff #159505)	SmallVector<SDNode *, 8> Built; like in all the other functions.
18143–18145 ↗	(On Diff #159505)	Early return please. SDValue NewCond = TLI.BuildREMEqFold(REMNode, CondNode, DAG, Built); if (!NewCond) return SDValue();
18146–18151 ↗	(On Diff #159505)	Hm. Why is this different from what `DAGCombiner::BuildUDIV()` does? I expected to see just the: for (SDNode *N : Built) AddToWorklist(N); return NewCond;
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721	This should be `SmallVectorImpl<SDNode *> &Created` like in all the other cases.
3746	Separate with newline.
3748–3749	Add self-explanatory assertion?
3760	This is begging to be wrong. Can you create two variables on different lines?
test/CodeGen/Hexagon/swp-const-tc2.ll
36 ↗	(On Diff #159505)	I think the change to this test can land right away separately from the rest of the changes.
test/CodeGen/X86/rem.ll
81 ↗	(On Diff #159505)	Why not instead put all this into a new `test/CodeGen/X86/rem-seteq.ll` file? I'm not sure whether it is worth it splitting the `urem` and `srem` tests into different files though.

Also, really minor, when adressing/fixing an inline comment, could you please actually mark that inline comment as fixed (via that checkbox), else it really isn't clear in which state which comment is..

hermord mentioned this in D50944: [Hexagon] [Test] Remove undef and infinite loop from test.Aug 18 2018, 7:16 PM

Thank you for the review, @lebedev.ri. Sorry for the delay; resolved most of the issues. Quick summary of the changes:

BuildREMEqFold is now invoked from SimplifySetCC, as it logically should be. The signature is changed to mirror the other SimplifySetCC helpers.
Tests are moved to a separate file. The format is slightly changed to get rid of irrelevant selects. Size optimization tests are added. The Hexagon test fix is split into a separate patch: D50944.
The | LHS | < | RHS | optimization (call it (*)) is removed from this patch.

Note: since changes were split into two patches, this now fails tests until D50944 gets merged. Even when that happens, this will still fail test1 in CodeGen/X86/jump_sign because of (*). This failure should probably be fixed by implementing the (*) in a separate patch, but I'm unsure as to the best way to go about it: since BuilREMEqFold moved out of visitREM, it will get invoked before any optimizations in visitREM have had a chance to run, as the SETCC node is processed first. As I see it, there are the following options:

Move BuildREMEqFold back into visitREM. This is ugly for other reasons, as @lebedev.ri pointed out, but it does solve this particular hurdle. (*) can then be added right before it.
Make (*) a separate function and invoke it in both visitREM and SimplifySetCC (before calling BuildREMEqFold). This may result in it running twice, which is probably bad. Computing KnownBits looks like it's expensive.
Check if (*) is possible in SimplifySetCC and, if so, don't touch the node. visitREM then implements the actual optimization.
Ignore (*) and implement an equivalent optimization for multiplication on the output of BuildREMEqFold.
Accept the probably rare regression (dividend must be known to be smaller than the divisor AND the result has to be fed into a comparison with zero) and edit the test.

I lack experience to make the right call here and would much appreciate any suggestions.

hermord marked 4 inline comments as done.Aug 20 2018, 12:12 AM

hermord added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3748–3749	Sorry, I couldn't figure out what the assertion should be. The negative case can still be handled just fine; do you mean that I should assert (D0 =/= 1) and then make the caller check for that?

lebedev.ri added a parent revision: D50944: [Hexagon] [Test] Remove undef and infinite loop from test.Aug 20 2018, 12:12 AM

lebedev.ri added inline comments.Aug 20 2018, 12:18 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3372–3392 ↗	(On Diff #159505)	You could put this fold into the middle-end, into instsimplify, if we don't do it already. But since it needs `computeKnownBits()`, some measurements would be needed (how many times does it fire on some codebase? possibly, measure compile-time impact)

Some more high-level nits/thoughs:

Other than -Os, do we always want to do this, for every target, with no customization hook?
Right now the tests look good (other than the run-line question), please commit them now.
Similarly, if we want to do it for other arches, also commit a copy of that same rem-seteq.ll test for aarch64 at least.
I think this should be further split into two reviews, unsigned case first :/

include/llvm/CodeGen/TargetLowering.h
3698	Separate with newline?
lib/CodeGen/SelectionDAG/TargetLowering.cpp
2780–2782	if (SDValue Folded = BuildREMEqFold(VT, N0, N1, Cond, DCI, dl)) return Folded;
3723–3725	The produced patterns are rather different between unsigned and signed cases. I think this should be at least done in two steps (two reviews, unsigned first), and likely these should be done by two separate functions.
test/CodeGen/X86/rem-seteq.ll
2	Any reason this is targeting `i386` specifically? I think we should test: ; RUN: llc -mtriple=i686-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X86 ; RUN: llc -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X64,NOBMI2 ; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+bmi2 < %s \| FileCheck %s --check-prefixes=CHECK,X64,BMI2

Thank you for the review, @lebedev.ri, addressed:

- Added isIntDivCheap as an additional condition preventing this optimization. If this isn't customizable enough, we could probably do something like if (isIntDivCheap || (minsize && isIntDivShort)) where isIntDivShort can be overriden per-target.
I'm not an LLVM developer; as far as I understand, I can't commit anything
Added tests for AArch64
Removed the signed bits

CodeGen/X86/jump_sign.ll still isn't passing; I've been trying different things, but none seem to make for a satisying fix (if I introduce the optimizaion in InstCombine, I'll still have to change the test to pipe through opt, which I'm not sure is better than just chaning the test to match current output). On a related note, after looking at it, should this entire thing be in InstCombine, actually?

Herald added a reviewer: javed.absar. · View Herald TranscriptAug 29 2018, 6:32 PM

hermord marked 3 inline comments as done.Aug 29 2018, 6:35 PM

Diffusion mentioned this in rL341046: [Hexagon][Test] Remove undef and infinite loop from test.Aug 30 2018, 2:34 AM

Baseline tests committed, rebase please.

test/CodeGen/X86/urem-seteq.ll
4 ↗	(On Diff #163245)	Actually, looking at the check-lines, i'm not sure we want to check the bmi2 version. I'm not seeing anything there that would benefit from anything above baseline, i think.

lebedev.ri added inline comments.Aug 30 2018, 2:39 AM

test/CodeGen/X86/jump_sign.ll
239–244	What do you mean by breaks? Is this a miscompilation?
test/CodeGen/X86/urem-seteq.ll
4 ↗	(On Diff #163245)	I've meant to drop this check-line, but apparently forgot. I'm not 100% sure if we want/don't want it.

lebedev.ri added a reviewer: lebedev.ri.Aug 30 2018, 3:45 AM

Starting to look much better. Some more nits.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3723	assert((Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) && "Only applicable for [in]equality comparisons.");
3723	// fold (seteq/ne (urem N, D), 0) -> (setule/ugt (rotr (mul N, P), K), Q) (it would be great to use more descriptive names to differentiate between `C`onstants and not constants, too)
3745	// Decompose D into D0 * 2^K
3746	bool DivisorIsEven = K != 0;
3749–3751	So what we are really checking here, is that `D` is not power of two. How about simplifying this into: APInt D = Divisor->getAPIntValue(); unsigned W = D.getBitWidth(); // If D is power of two, D0 would be 1, and we cannot build this fold. if(D.isPowerOf2()) return SDValue(); // Decompose D into D0 * 2^K ... BUT. I do not understand from where does this requirement comes from? Can you quote specific part of [[ https://doc.lagout.org/security/Hackers%20Delight.pdf \| `10–17 Test for Zero Remainder after Division by a Constant` ]] that warrants this?
3753–3757	Would [[ http://llvm.org/doxygen/classllvm_1_1APInt.html#a28ee15b0286415cce5599d4fb9f9ce02 \| `APInt::multiplicativeInverse()` ]] work here?
3760–3761	I would think just writing this as one line would be as clean APInt Q = APInt::getAllOnesValue(W).udiv(D0);
3767	I do not know if this is needed? At least rL338079 removed such a call.
3772	if(DivisorIsEven) {

lebedev.ri added inline comments.Aug 30 2018, 6:06 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3723	(it would be great to use more descriptive names to differentiate between Constants and not constants, too) But thinking about it a bit more, probably don't rename the variables. Consistency [with the orig doc] may be best.

lebedev.ri added inline comments.Aug 30 2018, 6:55 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3742–3751	Well, i see it now - https://rise4fun.com/Alive/aXUu. Then this should be: APInt D = Divisor->getAPIntValue(); if(D.isPowerOf2()) { // rem by power-of-two is better represented by and-mask. return SDValue(); } // Decompose D into D0 * 2^K ...
3760	Also, move the `W` here, to the use.

Comments addressed. The minsize condition needs some tweaking, it seems: the code with it works out to actually be longer on X86. Perhaps there should really be something like isIntDivShort.

Re: test1, by it being broken I only mean that it doesn't pass: it's not a miscompilation, just a less optimal output than desired.

hermord marked 10 inline comments as done.Aug 30 2018, 9:25 PM

hermord added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3749–3751	The algorithm also fails in general in this case because the inverse of 1 is 1 itself, and we get, for example: Given: N: 1, D: 4, C: 0 Then: K = 2 D0 = 1 P = 1 Q = (2^32 - 1)/1 = 2^32 - 1 And so: (P >>rot 2) == 2^30 < 2^32 - 1 == Q, so the condition holds, but N % D != 0 The book mentions this in passing: "This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not equal to 1 [...]" (p. 227).
test/CodeGen/X86/jump_sign.ll
239–244	Oh my bad, to clarify: I was referring to `test1` in this file and not the one that's changed here. This one is fine, but the code produced for `test1` after this patch would be less optimal than before. The output of `BuildUREMEqFold` just doesn't seem to fall into patterns that we're already optimizing well in this particular case (`N` is extended from `i1`, then `AND`'ed and then passed into an UREM), and we'd probably have to do a new `KnownBits`-based optimization elsewhere to avoid this. Via `InstCombine` this can actually be fixed without close to no extra overhead because we are already computing the necessary bits anyway.

Updated AArch64 tests.

This looks about right as far as i can tell.
But would be best if others could review, too.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3723	The comment change wasn't applied, i'm not aware of `gte` predicate, but i do know `[u/s]ge`. Also, reading the actual code, `ugt` was meant it seems.
3728	You can move this closer to where it's actually used.
3745	You don't actually modify `D` as far as i can tell? I think it should be `const APInt &D`
3755	Ok, thank you for the explanation regarding `D0 != 1`. I think here now we can just add an assert, to document it. APInt D0 = D.lshr(K); assert(!D0.isOneValue() && "The fold is invalid for D0 of 1. Unreachable since we leave powers-of-two to be improved elsewhere.");
3770	Drop the comment about signed cases? I strongly believe that should be done in a new function.
3780	Drop the comment about signed cases? I strongly believe that should be done in a new function.

Please add uniform and non-uniform vector test cases as well.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3738	It should be trivial for you to add non-uniform vector support in this patch, following the patterns I did for the other divsion-by-constant builders.

And according to GCC mailing list (https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00193.html) and their stats, unsigned case of X % C1 == C2 is still worth to handle as well.

In D50222#1221943, @RKSimon wrote:

Please add uniform and non-uniform vector test cases as well.

In D50222#1223180, @xbolva00 wrote:

And according to GCC mailing list (https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00193.html) and their stats, unsigned case of X % C1 == C2 is still worth to handle as well.

@hermord to reiterate, as far i'm concerned, this only needs vector tests.
Everything else should go into new reviews (srem, urem nonsplat, srem nonsplat; non-zero constants (another 4 reviews ideally?)
As for the vector tests, just add them here, i will precommit.
I think, you want to operate on <4 x i32>, with the run line

; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SSE2

and

; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s | FileCheck %s

Just completely duplicate the existing test files with -vector suffix. I.e. if you have

define i32 @test_urem_odd(i32 %X) nounwind readnone {
  %urem = urem i32 %X, 5
  %cmp = icmp eq i32 %urem, 0
  %ret = zext i1 %cmp to i32
  ret i32 %ret
}

It will be

define <4 x i32> @test_urem_odd_vec(<4 x i32> %X) nounwind readnone {
  %urem = urem <4 x i32> %X, <i32 5, i32 5, i32 5, i32 5>
  %cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>
  %ret = zext <4 x i1> %cmp to <4 x i32>
  ret <4 x i32> %ret
}

I'm not quite sure about nonsplat tests. things to keep in mind there:

You have two constants, so add three positive tests with one constant element being undef (i.e. first test with undef in the first constant, second test with undef in the second constant, and a test with undef in both)
The codepath is different depending on whether the divisor is even. So you want to add a test where half of the divisor is even, and half is odd
???

hermord updated this revision to Diff 164794.Sep 10 2018, 9:34 PM

Apologies again for the delay.

Made cosmetic changes as directed by @lebedev.ri and added vector tests.
Disabled the fold for vector divisors with even values (see inline comment and test_urem_even_vec_i16).

Nonsplat divisors, nonzero C2 and possibly target customization (something like bool shouldBuildUREMEqFold(SDValue Divisor, SDValue CompTarget, bool isSREM, const DAGCombinerInfo &DCI)?) will be in separate patches.

The splat tests are only partly the same as the scalar ones because bit30/bit31 is not handled any differently in vectors and so is redundant; on the other hand, testing the <4 x i16> case turned out to be valuable (identified the ROTR-induced crash descibed in inline comment).

hermord marked 5 inline comments as done.Sep 10 2018, 9:49 PM

hermord added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3738	I've got conflicting directions for this patch (add nonsplat here or submit it separately), so I didn't touch that part in this update.
3755	This turns out to actually be reachable: the logic that handles REM by powers of two is in `visitREM`, which is invoked after `visitSetCC` from where we would've come here. I kept this as an early return if that's OK.
test/CodeGen/X86/jump_sign.ll
410	This is the `test1` regression I mentioned in previous updates. I've made it explicit now, for the lack of a clean fix that I can see (without `computeKnownBits`).
test/CodeGen/X86/urem-seteq.ll
4 ↗	(On Diff #163245)	Should I drop it? On a related note, I could add `AVX2` to the vector tests on `X86` if that's likely to be useful.

Diffusion mentioned this in rL341943: [Hexagon] [Test] Remove undef and infinite loop from test.Sep 11 2018, 7:09 AM

RKSimon added inline comments.Sep 11 2018, 8:35 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3748	APInt::lshr returns a value not a reference.
3767	APInt::zext returns a value not a reference.
3772	Again, value not a reference
3779	You must ensure that MUL + ROTR are legal when necessary - pass a IsAfterLegalization flag into the function and check with isOperationLegalOrCustom/isOperationLegal - see TargetLowering::BuildSDIV
test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
2 ↗	(On Diff #164794)	Run with avx2 as well for better test coverage.
6 ↗	(On Diff #164794)	You don't need nonsplat in the test name - its in the filename.
test/CodeGen/X86/urem-seteq-vec-splat.ll
2 ↗	(On Diff #164794)	Run with avx2 as well for better test coverage.
27 ↗	(On Diff #164794)	Use legal types - 8 x i16 etc.

Updated the tests in rL341953.
As far i'm concerned this is good to go.
Vector and vector non-splat support sounds like a great task for the first follow-up :)
But, still let's wait for others to comment for a bit..

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3755	Sounds good.
test/CodeGen/X86/urem-seteq.ll
4 ↗	(On Diff #163245)	Right, good idea. I checked all the various sse/avx versions, and kept the unique ones that make a difference here.

This revision is now accepted and ready to land.Sep 11 2018, 8:38 AM

Thank you for the review, @RKSimon. Made changes as directed.

hermord marked 8 inline comments as done.Sep 11 2018, 4:09 PM

hermord marked an inline comment as done.

In D50222#1231315, @hermord wrote:

Thank you for the review, @RKSimon. Made changes as directed.

Can you please rebase your differential, so the diff is as compared to the svn?
(i have previously committed the vector tests, but they will need to be recommitted now).

Rebased.

Seems fine for trunk now..

RKSimon requested changes to this revision.Sep 12 2018, 1:41 PM

RKSimon added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721	Isn't this still missing?
test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
6 ↗	(On Diff #165115)	Please add these test cases back.
test/CodeGen/X86/urem-seteq-vec-splat.ll
6 ↗	(On Diff #165115)	Please add these test cases back.

This revision now requires changes to proceed.Sep 12 2018, 1:41 PM

Re-added previously commited tests.

hermord marked 2 inline comments as done.Sep 13 2018, 9:25 PM

hermord added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721	The approach is different now: `BuildUREMEqFold` is now called from `TargetLowering::SimplifySetCC` and not from `DAGCombiners::visitREM`, and l tried to match the style of the other helpers called from there (like `simplifySetCCWithAnd`).

@RKSimon should I make any other changes to this?

Hm, i stalled the review here it seems, sorry.
See inline comment.

I think you only updated the splat vector tests to use legal vectors (<8 x i16> instead of <4 x i16>),
but did not update the non-splat test files?
Can you update them too please? I'll then re-commit the tests once more, hopefully the last time :)

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721	Looking at the code on which this was originally based, i think i was slightly wrong. This should still have a pointer to smallvector, and it should push back the newly created `SDValue`s. And this function should be called from a new helper function that would make proper use of that. See `TargetLowering::BuildSDIVPow2()` (which is the implementation), vs `DAGCombiner::BuildSDIVPow2()` (which is the wrapper).

Hm, i stalled the review here it seems, sorry.
See inline comment.

@lebedev.ri Sorry, couldn't find any instances of 4 x i16 in the nonsplat tests; I think I only had those in the splat ones. The patch no longer applied cleanly, so I rebased it again.

hermord added inline comments.Sep 25 2018, 6:38 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721	Ah sorry, just to clarify, there used to be a `DAGCombiner::BuildREMEqFold` in an earlier version of this patch, but my rationale for removing it was that: We wanted to generate this fold when visiting the SETCC and not the UREM. `DAGCombiner::visitSetCC` delegates to `DAGCombiner::SimplifySetCC`, which is just a stub calling `TargetLowering::SimplifySetCC`. `TargetLowering::SimplifySetCC` adds nodes to the worklist directly and doesn't return an `SDValue` vector. If it's better to mimic `BuildSDIVPow2` and have a wrapper, I could: Have `TargetLowering::SimplifySetCC` call the wrapper from `DAGCombiner`, at the risk of this being slightly confusing (I don't believe anything else in that function does that?) Remove the call to `TargetLowering::BuildUREMEqFold` from `TargetLowering::SimplifySetCC` and add a call to the wrapper instead to `DAGCombiner::SimplifySetCC` or `DAGCombiner::visitSETCC`. Currently neither actually contains any case-specific simplification logic, so I'm not sure if I should change that.

In D50222#1244716, @hermord wrote:

@lebedev.ri Sorry, couldn't find any instances of 4 x i16 in the nonsplat tests; I think I only had those in the splat ones.

Hm, could be. I don't see them either now.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721	This current `BuildUREMEqFold()` does not `adds nodes to the worklist directly` as far as i can see? Well, i had the third variant in mind. Rename this function to something like `BuildUREMEqFold_REAL()` (bad name, just an example) Add a new wrapper `BuildUREMEqFold()` here in this file, right after `BuildUREMEqFold_REAL()`. So nothing changes for the callers. And the wrapper will do all that stuff about receiving the vector of newly created `SDNodes`, and adding them to the worklist.

Made a wrapper as dicussed.

hermord added inline comments.Sep 25 2018, 4:52 PM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721	It doesn't right now, this is true. I removed it following this comment: Comment at: lib/CodeGen/SelectionDAG/TargetLowering.cpp:3771 + SDValue Op1 = DAG.getNode(ISD::MUL, DL, REMVT, REMNode->getOperand(0), PVal); + DCI.AddToWorklist(Op1.getNode()); + // Will change in the signed case ---------------- I do not know if this is needed? At least rL338079 removed such a call. Looking back at rL338079, it doesn't say that `AddToWorklist` is never necessary in this context, and similar code (`BuildUDIV`) does do it, so I removing it might have been premature. I took a stab at the third variant in the latest diff.

Any futher blockers here?

RKSimon added inline comments.Oct 2 2018, 9:24 AM

test/CodeGen/AArch64/urem-seteq-vec-nonsplat.ll
6 ↗	(On Diff #167025)	You can commit the nonsplat test name changes as an NFC now to reduce this patch.
test/CodeGen/X86/jump_sign.ll
410	Have you made any progress working out what the problems is with test1?
test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
10 ↗	(On Diff #167025)	You can commit the nonsplat test name changes as an NFC now to reduce this patch.
test/CodeGen/X86/urem-seteq-vec-splat.ll
66 ↗	(On Diff #167025)	These <4 x i16> -> <8 x i16> test changes need to be done as an NFC commit, showing the current codegen and then this patch rebased. Its up to you if you keep the aarch64 using <4 x i16> or not but the x86 versions need to be changed to a legal type.

RKSimon mentioned this in rL343934: [AARCH64][X86] Remove _nonsplat from test names.Oct 7 2018, 4:25 AM

xbolva00 removed a reviewer: xbolva00.Oct 15 2018, 2:08 PM

xbolva00 removed a subscriber: xbolva00.

@hermord Are you still looking at this?

Btw, why we do not this transformation on IR-level at first?

spatel added a subscriber: spatel.Nov 24 2018, 7:43 AM

nikic added a subscriber: nikic.Nov 24 2018, 3:03 PM

@hermord Do you mind if I commandeer this to get it finished please?

@RKSimon I am sorry for the long radio silence. It is entirely my fault.

Regarding test1, the problem seems to be that once this fold is applied, producing something of the form (setule (mul t, -85), 85), it is difficult to deduce that this is the same as (seteq t, 0) in this case. In what was produced before, (seteq (sub t, (mul ...)), 0), a series of optimizations could figure out that the mul node is always zero.

This can be worked around in InstCombine without any additional KnownBits computations beyond what already happens. I'd looked into implementing a fix there before (mentioned in dicussion with @lebedev.ri) and could replicate it, but this will require running opt on jump_sign.ll to restore previous behavior.

I do not have commit rights, so I cannot enact the suggestion to commit the tests as NFC, but I can implement the above (or any other) approach for test1 now, if required, to finalize the patch. I feel like this is very close to completion, but if this is not acceptable, please feel free to commandeer the patch.

Ok, so create new revision with tests? Somebody will commit it for you.

hermord mentioned this in D56372: [NFC] Make vector types legal in UREM test.Jan 6 2019, 5:41 PM

Rebased on top of D56372.

RKSimon mentioned this in rL353004: [NFC] Make vector types legal in UREM test.Feb 3 2019, 11:38 AM

RKSimon mentioned this in rG135413d38156: [NFC] Make vector types legal in UREM test.

@hermord reverse ping

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 4:26 AM

Hi guys, I found the magical formula for unsigned integers that works also with even numbers without the need to check for overflows with any remainder:
from divisor d and reminder r, I calculate 4 constants.
any d!=0 should fit.

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;/* note that I divide by d, not d_odd */
}

A little bit background: the constant (u +1) is the count of the possible values in the range of 64 bit number that will yield the correct modulo.
The constant s should zero the first k bits if the given value have modulo of r. it will also zero the modulo of d_odd.

then the compiler should generate the following code with the given constants:

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}

this replace the following:

/* d is the divisor, r is the remainder */
int checkrem64(uint64_t x)
{
  return x%d==r;
}

this is the code to find modular inverse..

uint64_t find_mod_mul_inverse(uint64_t x, uint64_t bits)
{
      if (bits > 64 || ((x&1)==0))
              return 0;// invalid parameters
      uint64_t mask;
      if (bits == 64)
              mask = -1;
      else
      {                
              mask = 1;
              mask<<=bits;
              mask--;
      }
      x&=mask;
      uint64_t result=1, state=x, ctz=0;
      while(state!=1ULL)
      {
              ctz=__builtin_ctzll(state^1);
              result|=1ULL<<ctz;
              state+=x<<ctz;
              state&=mask;
      }
      return result;
}

good luck!
I tested this on all the cases of 10bit word size, and it passed.

*Edit:* I looked for something that will work for signed integers. I came up with something that would work with negative numbers if the following assumption was correct:

(-21)%10==9

but this assumption is not correct because (-21)%10 equals to -1.
anyway, the idea is like that, you shift the range and change s and accordingly:

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
     //this is the added line to make it work with signed integers
      r+=0x8000 0000 0000 0000% d;
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;
}

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
     //this is the added line to make it work with signed integers
//x came as signed number but was casted to unsigned
    x^=0x8000 0000 0000 0000;// this is addition simplified to xor, spaces for clarity.
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}

but there must be a way to tweak u and s to make it work on negative only numbers or positive only numbers.... it should be easy...
Can someone please contact me to do the math?
I need a background and explanation how the folding of unsigned numbers should act and then I will try to find for you the formula... dvoreader@gmail.com
@jdoerfert, please read this carefully, cause you did not calculated correctly the Q constant when D is even. that is why it did not worked for you when D was 4, and it will also not work for you when D is 6 or 10 etc.

This revision now requires changes to proceed.Feb 23 2019, 11:03 AM

Herald added a subscriber: jdoerfert. · View Herald TranscriptFeb 23 2019, 11:03 AM

Rebased. Moved to DAGCombiner and fixed regression in jump_sign.ll.

xbolva00 edited the summary of this revision. (Show Details)Jun 14 2019, 3:00 PM

Oh, no. Still there :((

xbolva00 abandoned this revision.Jun 14 2019, 3:27 PM

Hmm, i don't see the button to take the review,
so i had to post a new one: D63391
The test/CodeGen/X86/jump_sign.ll regression is rather trivial, and is being fixed by a followup patch D63390.

lebedev.ri mentioned this in rGcdd43eac4fe3: [Codegen] TargetLowering::SimplifySetCC(): omit urem when possible.Jun 25 2019, 3:05 AM

Diffusion mentioned this in rL364286: [Codegen] TargetLowering::SimplifySetCC(): omit urem when possible.Jun 25 2019, 3:07 AM

Diffusion mentioned this in rL364563: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case) (try 2).Jun 27 2019, 9:45 AM

lebedev.ri mentioned this in rG0627b09863b8: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case) (try 2).Jun 27 2019, 9:47 AM

Diffusion mentioned this in rL364600: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case) (try 3).Jun 27 2019, 2:52 PM

lebedev.ri mentioned this in rG29d05c005fa8: [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case) (try 3).Jun 27 2019, 2:53 PM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

TargetLowering.h

2 lines

lib/

CodeGen/

SelectionDAG/

TargetLowering.cpp

121 lines

test/

CodeGen/

X86/

12 lines

22 lines

264 lines

33 lines

Diff 161419

include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 3,502 Lines • ▼ Show 20 Lines	SDValue BuildSDIV(SDNode *N, SelectionDAG &DAG, bool IsAfterLegalization,
SmallVectorImpl<SDNode *> &Created) const;		SmallVectorImpl<SDNode *> &Created) const;
SDValue BuildUDIV(SDNode *N, SelectionDAG &DAG, bool IsAfterLegalization,		SDValue BuildUDIV(SDNode *N, SelectionDAG &DAG, bool IsAfterLegalization,
SmallVectorImpl<SDNode *> &Created) const;		SmallVectorImpl<SDNode *> &Created) const;

/// Targets may override this function to provide custom SDIV lowering for		/// Targets may override this function to provide custom SDIV lowering for
/// power-of-2 denominators. If the target returns an empty SDValue, LLVM		/// power-of-2 denominators. If the target returns an empty SDValue, LLVM
/// assumes SDIV is expensive and replaces it with a series of other integer		/// assumes SDIV is expensive and replaces it with a series of other integer
/// operations.		/// operations.
virtual SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor,		virtual SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor,
		RKSimonUnsubmitted Not Done Reply Inline Actions Please can you not include the Divisor - we are aiming for full vector support for divisions, not just splats - see D50185 RKSimon: Please can you not include the Divisor - we are aiming for full vector support for divisions…
		lebedev.riUnsubmitted Done Reply Inline Actions This should be `SmallVectorImpl<SDNode > &Created` like in all the other cases. lebedev.ri:* This should be `SmallVectorImpl<SDNode *> &Created` like in all the other cases.
SelectionDAG &DAG,		SelectionDAG &DAG,
SmallVectorImpl<SDNode *> &Created) const;		SmallVectorImpl<SDNode *> &Created) const;

/// Indicate whether this target prefers to combine FDIVs with the same		/// Indicate whether this target prefers to combine FDIVs with the same
/// divisor. If the transform should never be done, return zero. If the		/// divisor. If the transform should never be done, return zero. If the
/// transform should be done, return the minimum number of divisor uses		/// transform should be done, return the minimum number of divisor uses
/// that must exist.		/// that must exist.
virtual unsigned combineRepeatedFPDivisors() const {		virtual unsigned combineRepeatedFPDivisors() const {
▲ Show 20 Lines • Show All 170 Lines • ▼ Show 20 Lines	private:
SDValue simplifySetCCWithAnd(EVT VT, SDValue N0, SDValue N1,		SDValue simplifySetCCWithAnd(EVT VT, SDValue N0, SDValue N1,
ISD::CondCode Cond, DAGCombinerInfo &DCI,		ISD::CondCode Cond, DAGCombinerInfo &DCI,
const SDLoc &DL) const;		const SDLoc &DL) const;

SDValue optimizeSetCCOfSignedTruncationCheck(EVT SCCVT, SDValue N0,		SDValue optimizeSetCCOfSignedTruncationCheck(EVT SCCVT, SDValue N0,
SDValue N1, ISD::CondCode Cond,		SDValue N1, ISD::CondCode Cond,
DAGCombinerInfo &DCI,		DAGCombinerInfo &DCI,
const SDLoc &DL) const;		const SDLoc &DL) const;
		SDValue BuildREMEqFold(EVT VT, SDValue N0, SDValue N1, ISD::CondCode Cond,
		lebedev.riUnsubmitted Done Reply Inline Actions Separate with newline? lebedev.ri: Separate with newline?
		DAGCombinerInfo &DCI, const SDLoc &DL) const;
};		};

/// Given an LLVM IR type and return type attributes, compute the return value		/// Given an LLVM IR type and return type attributes, compute the return value
/// EVTs and flags, and optionally also the offsets, if the return value is		/// EVTs and flags, and optionally also the offsets, if the return value is
/// being lowered to memory.		/// being lowered to memory.
void GetReturnInfo(CallingConv::ID CC, Type *ReturnType, AttributeList attr,		void GetReturnInfo(CallingConv::ID CC, Type *ReturnType, AttributeList attr,
SmallVectorImpl<ISD::OutputArg> &Outs,		SmallVectorImpl<ISD::OutputArg> &Outs,
const TargetLowering &TLI, const DataLayout &DL);		const TargetLowering &TLI, const DataLayout &DL);

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_CODEGEN_TARGETLOWERING_H		#endif // LLVM_CODEGEN_TARGETLOWERING_H

lib/CodeGen/SelectionDAG/TargetLowering.cpp

Show First 20 Lines • Show All 2,768 Lines • ▼ Show 20 Lines	if (N1.getOpcode() == ISD::ADD \|\| N1.getOpcode() == ISD::SUB \|\|
}		}
}		}
}		}

if (SDValue V = simplifySetCCWithAnd(VT, N0, N1, Cond, DCI, dl))		if (SDValue V = simplifySetCCWithAnd(VT, N0, N1, Cond, DCI, dl))
return V;		return V;
}		}

		// Fold remainder of division by a a constant
		if ((Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) && N0.hasOneUse() &&
		(N0.getOpcode() == ISD::UREM \|\| N0.getOpcode() == ISD::SREM)) {
		SDValue Folded = BuildREMEqFold(VT, N0, N1, Cond, DCI, dl);
		if (Folded)
		return Folded;
		lebedev.riUnsubmitted Done Reply Inline Actions if (SDValue Folded = BuildREMEqFold(VT, N0, N1, Cond, DCI, dl)) return Folded; lebedev.ri: ``` if (SDValue Folded = BuildREMEqFold(VT, N0, N1, Cond, DCI, dl)) return Folded; ```
		}

// Fold away ALL boolean setcc's.		// Fold away ALL boolean setcc's.
SDValue Temp;		SDValue Temp;
if (N0.getValueType().getScalarType() == MVT::i1 && foldBooleans) {		if (N0.getValueType().getScalarType() == MVT::i1 && foldBooleans) {
EVT OpVT = N0.getValueType();		EVT OpVT = N0.getValueType();
switch (Cond) {		switch (Cond) {
default: llvm_unreachable("Unknown integer setcc!");		default: llvm_unreachable("Unknown integer setcc!");
case ISD::SETEQ: // X == Y -> ~(X^Y)		case ISD::SETEQ: // X == Y -> ~(X^Y)
Temp = DAG.getNode(ISD::XOR, dl, OpVT, N0, N1);		Temp = DAG.getNode(ISD::XOR, dl, OpVT, N0, N1);
▲ Show 20 Lines • Show All 913 Lines • ▼ Show 20 Lines	SDValue TargetLowering::BuildUDIV(SDNode *N, SelectionDAG &DAG,
Q = DAG.getNode(ISD::SRL, dl, VT, Q, PostShift);		Q = DAG.getNode(ISD::SRL, dl, VT, Q, PostShift);
Created.push_back(Q.getNode());		Created.push_back(Q.getNode());

SDValue One = DAG.getConstant(1, dl, VT);		SDValue One = DAG.getConstant(1, dl, VT);
SDValue IsOne = DAG.getSetCC(dl, VT, N1, One, ISD::SETEQ);		SDValue IsOne = DAG.getSetCC(dl, VT, N1, One, ISD::SETEQ);
return DAG.getSelect(dl, VT, IsOne, N0, Q);		return DAG.getSelect(dl, VT, IsOne, N0, Q);
}		}

		/// Given an ISD::UREM or ISD::SREM used only by an ISD::SETEQ or ISD::SETNE
		/// where the divisor is constant and the comparison target is zero,
		/// return a DAG expression that will generate the same comparison result
		/// using only multiplications, additions and shifts/rotations.
		/// Ref: "Hacker's Delight" 10-17.
		SDValue TargetLowering::BuildREMEqFold(EVT VT, SDValue REMNode,
		SDValue CompNode, ISD::CondCode Cond,
		DAGCombinerInfo &DCI,
		lebedev.riUnsubmitted Done Reply Inline Actions This should be `SmallVectorImpl<SDNode > &Created` like in all the other cases. lebedev.ri:* This should be `SmallVectorImpl<SDNode *> &Created` like in all the other cases.
		RKSimonUnsubmitted Not Done Reply Inline Actions Isn't this still missing? RKSimon: Isn't this still missing?
		hermordUnsubmitted Not Done Reply Inline Actions The approach is different now: `BuildUREMEqFold` is now called from `TargetLowering::SimplifySetCC` and not from `DAGCombiners::visitREM`, and l tried to match the style of the other helpers called from there (like `simplifySetCCWithAnd`). hermord: The approach is different now: `BuildUREMEqFold` is now called from `TargetLowering…
		lebedev.riUnsubmitted Not Done Reply Inline Actions Looking at the code on which this was originally based, i think i was slightly wrong. This should still have a pointer to smallvector, and it should push back the newly created `SDValue`s. And this function should be called from a new helper function that would make proper use of that. See `TargetLowering::BuildSDIVPow2()` (which is the implementation), vs `DAGCombiner::BuildSDIVPow2()` (which is the wrapper). lebedev.ri: Looking at the code on which this was originally based, i think i was slightly wrong. This…
		hermordUnsubmitted Not Done Reply Inline Actions Ah sorry, just to clarify, there used to be a `DAGCombiner::BuildREMEqFold` in an earlier version of this patch, but my rationale for removing it was that: We wanted to generate this fold when visiting the SETCC and not the UREM. `DAGCombiner::visitSetCC` delegates to `DAGCombiner::SimplifySetCC`, which is just a stub calling `TargetLowering::SimplifySetCC`. `TargetLowering::SimplifySetCC` adds nodes to the worklist directly and doesn't return an `SDValue` vector. If it's better to mimic `BuildSDIVPow2` and have a wrapper, I could: Have `TargetLowering::SimplifySetCC` call the wrapper from `DAGCombiner`, at the risk of this being slightly confusing (I don't believe anything else in that function does that?) Remove the call to `TargetLowering::BuildUREMEqFold` from `TargetLowering::SimplifySetCC` and add a call to the wrapper instead to `DAGCombiner::SimplifySetCC` or `DAGCombiner::visitSETCC`. Currently neither actually contains any case-specific simplification logic, so I'm not sure if I should change that. hermord: Ah sorry, just to clarify, there used to be a `DAGCombiner::BuildREMEqFold` in an earlier…
		lebedev.riUnsubmitted Not Done Reply Inline Actions This current `BuildUREMEqFold()` does not `adds nodes to the worklist directly` as far as i can see? Well, i had the third variant in mind. Rename this function to something like `BuildUREMEqFold_REAL()` (bad name, just an example) Add a new wrapper `BuildUREMEqFold()` here in this file, right after `BuildUREMEqFold_REAL()`. So nothing changes for the callers. And the wrapper will do all that stuff about receiving the vector of newly created `SDNodes`, and adding them to the worklist. lebedev.ri: This current `BuildUREMEqFold()` does not `adds nodes to the worklist directly` as far as i can…
		hermordUnsubmitted Not Done Reply Inline Actions It doesn't right now, this is true. I removed it following this comment: Comment at: lib/CodeGen/SelectionDAG/TargetLowering.cpp:3771 + SDValue Op1 = DAG.getNode(ISD::MUL, DL, REMVT, REMNode->getOperand(0), PVal); + DCI.AddToWorklist(Op1.getNode()); + // Will change in the signed case ---------------- I do not know if this is needed? At least rL338079 removed such a call. Looking back at rL338079, it doesn't say that `AddToWorklist` is never necessary in this context, and similar code (`BuildUDIV`) does do it, so I removing it might have been premature. I took a stab at the third variant in the latest diff. hermord: It doesn't right now, this is true. I removed it following this comment: ``` Comment at…
		const SDLoc &DL) const {
		// fold (seteq/ne (urem N, D), 0) -> (setule/gte (rotr (mul N, P), K), Q)
		lebedev.riUnsubmitted Done Reply Inline Actions assert((Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) && "Only applicable for [in]equality comparisons."); lebedev.ri: ``` assert((Cond == ISD::SETEQ \|\| Cond == ISD::SETNE) && "Only applicable for…
		lebedev.riUnsubmitted Done Reply Inline Actions // fold (seteq/ne (urem N, D), 0) -> (setule/ugt (rotr (mul N, P), K), Q) (it would be great to use more descriptive names to differentiate between `C`onstants and not constants, too) lebedev.ri: ``` // fold (seteq/ne (urem N, D), 0) -> (setule/ugt (rotr (mul N, P), K), Q) ``` (it would be…
		lebedev.riUnsubmitted Done Reply Inline Actions (it would be great to use more descriptive names to differentiate between Constants and not constants, too) But thinking about it a bit more, probably don't rename the variables. Consistency [with the orig doc] may be best. lebedev.ri: > (it would be great to use more descriptive names to differentiate between Constants and not…
		lebedev.riUnsubmitted Done Reply Inline Actions The comment change wasn't applied, i'm not aware of `gte` predicate, but i do know `[u/s]ge`. Also, reading the actual code, `ugt` was meant it seems. lebedev.ri: The comment change wasn't applied, i'm not aware of `gte` predicate, but i do know `[u/s]ge`.
		// fold (seteq/ne (srem N, D), 0)
		// -> (setule/gte (rotr (add (mul N, P), Q''), K), (srl Q', K-1))
		lebedev.riUnsubmitted Done Reply Inline Actions The produced patterns are rather different between unsigned and signed cases. I think this should be at least done in two steps (two reviews, unsigned first), and likely these should be done by two separate functions. lebedev.ri: The produced patterns are rather different between unsigned and signed cases. I think this…
		// - D must be constant with D = D0 * 2^K where D0 is odd
		// - P is the multiplicative inverse of D0 modulo 2^W
		// - Q = floor((2^W - 1) / D0)
		lebedev.riUnsubmitted Done Reply Inline Actions You can move this closer to where it's actually used. lebedev.ri: You can move this closer to where it's actually used.
		// - Q' = floor((2^(W - 1) - 1) / D0)
		// - Q'' = Q' & -2^K
		// where W is the width of the common type of N and D
		SelectionDAG &DAG = DCI.DAG;
		EVT REMVT = REMNode->getValueType(0);
		bool IsEq = (Cond == ISD::SETEQ);
		bool IsSigned = (REMNode->getOpcode() == ISD::SREM);

		if (!isTypeLegal(REMVT))
		return SDValue();
		RKSimonUnsubmitted Not Done Reply Inline Actions It should be trivial for you to add non-uniform vector support in this patch, following the patterns I did for the other divsion-by-constant builders. RKSimon: It should be trivial for you to add non-uniform vector support in this patch, following the…
		hermordUnsubmitted Not Done Reply Inline Actions I've got conflicting directions for this patch (add nonsplat here or submit it separately), so I didn't touch that part in this update. hermord: I've got conflicting directions for this patch (add nonsplat here or submit it separately), so…

		// Keep divrem when optimizing for minimum size
		if (DAG.getMachineFunction().getFunction().optForMinSize())
		return SDValue();

		// TODO: Add non-uniform constant support
		ConstantSDNode *Divisor = isConstOrConstSplat(REMNode->getOperand(1));
		lebedev.riUnsubmitted Done Reply Inline Actions // Decompose D into D0 * 2^K lebedev.ri: ``` // Decompose D into D0 * 2^K ```
		lebedev.riUnsubmitted Not Done Reply Inline Actions You don't actually modify `D` as far as i can tell? I think it should be `const APInt &D` lebedev.ri: You don't actually modify `D` as far as i can tell? I think it should be `const APInt &D`
		ConstantSDNode *CompTarget = isConstOrConstSplat(CompNode);
		lebedev.riUnsubmitted Done Reply Inline Actions Separate with newline. lebedev.ri: Separate with newline.
		lebedev.riUnsubmitted Done Reply Inline Actions bool DivisorIsEven = K != 0; lebedev.ri: ``` bool DivisorIsEven = K != 0; ```
		if (!Divisor \|\| !CompTarget \|\| Divisor->isNullValue() \|\|
		!CompTarget->isNullValue())
		RKSimonUnsubmitted Done Reply Inline Actions APInt::lshr returns a value not a reference. RKSimon: APInt::lshr returns a value not a reference.
		return SDValue();
		lebedev.riUnsubmitted Not Done Reply Inline Actions Add self-explanatory assertion? lebedev.ri: Add self-explanatory assertion?
		hermordUnsubmitted Not Done Reply Inline Actions Sorry, I couldn't figure out what the assertion should be. The negative case can still be handled just fine; do you mean that I should assert (D0 =/= 1) and then make the caller check for that? hermord: Sorry, I couldn't figure out what the assertion should be. The negative case can still be…

		majnemerUnsubmitted Not Done Reply Inline Actions Is this supposed to be ashr even if we are doing UREM? Seems a little unusual but if this is correct, it probably earns a comment. majnemer: Is this supposed to be ashr even if we are doing UREM? Seems a little unusual but if this is…
		APInt D = Divisor->getAPIntValue();
		lebedev.riUnsubmitted Not Done Reply Inline Actions So what we are really checking here, is that `D` is not power of two. How about simplifying this into: APInt D = Divisor->getAPIntValue(); unsigned W = D.getBitWidth(); // If D is power of two, D0 would be 1, and we cannot build this fold. if(D.isPowerOf2()) return SDValue(); // Decompose D into D0 * 2^K ... BUT. I do not understand from where does this requirement comes from? Can you quote specific part of [[ https://doc.lagout.org/security/Hackers%20Delight.pdf \| `10–17 Test for Zero Remainder after Division by a Constant` ]] that warrants this? lebedev.ri: So what we are really checking here, is that `D` is not power of two. How about simplifying…
		hermordUnsubmitted Not Done Reply Inline Actions The algorithm also fails in general in this case because the inverse of 1 is 1 itself, and we get, for example: Given: N: 1, D: 4, C: 0 Then: K = 2 D0 = 1 P = 1 Q = (2^32 - 1)/1 = 2^32 - 1 And so: (P >>rot 2) == 2^30 < 2^32 - 1 == Q, so the condition holds, but N % D != 0 The book mentions this in passing: "This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not equal to 1 [...]" (p. 227). hermord: The algorithm also fails in general in this case because the inverse of 1 is 1 itself, and we…
		lebedev.riUnsubmitted Done Reply Inline Actions Well, i see it now - https://rise4fun.com/Alive/aXUu. Then this should be: APInt D = Divisor->getAPIntValue(); if(D.isPowerOf2()) { // rem by power-of-two is better represented by and-mask. return SDValue(); } // Decompose D into D0 * 2^K ... lebedev.ri: Well, i see it now - https://rise4fun.com/Alive/aXUu. Then this should be: ``` APInt D =…
		unsigned W = D.getBitWidth();

		// The algorithm implemented below assumes D0 > 1
		majnemerUnsubmitted Not Done Reply Inline Actions The sign bit may be set for APInts fed into URem operations: all isNegative does is check that the MSB is set. I don't think it is OK to branch on the sign bit here without considering if we are in dealing with URem or SRem. I think what you want is: if (IsSigned) { D0 = D.ashr(K); } else { D0 = D.lshr(K); } majnemer: The sign bit may be set for APInts fed into URem operations: all isNegative does is check that…
		// We use the fact that N % D == 0 <=> N % \|D\| == 0 to guarantee D > 0 here
		lebedev.riUnsubmitted Not Done Reply Inline Actions Ok, thank you for the explanation regarding `D0 != 1`. I think here now we can just add an assert, to document it. APInt D0 = D.lshr(K); assert(!D0.isOneValue() && "The fold is invalid for D0 of 1. Unreachable since we leave powers-of-two to be improved elsewhere."); lebedev.ri: Ok, thank you for the explanation regarding `D0 != 1`. I think here now we can just add an…
		hermordUnsubmitted Not Done Reply Inline Actions This turns out to actually be reachable: the logic that handles REM by powers of two is in `visitREM`, which is invoked after `visitSetCC` from where we would've come here. I kept this as an early return if that's OK. hermord: This turns out to actually be reachable: the logic that handles REM by powers of two is in…
		lebedev.riUnsubmitted Done Reply Inline Actions Sounds good. lebedev.ri: Sounds good.
		if (IsSigned && D.isNegative())
		D.negate();
		lebedev.riUnsubmitted Done Reply Inline Actions Would [[ http://llvm.org/doxygen/classllvm_1_1APInt.html#a28ee15b0286415cce5599d4fb9f9ce02 \| `APInt::multiplicativeInverse()` ]] work here? lebedev.ri: Would [[ http://llvm.org/doxygen/classllvm_1_1APInt.html#a28ee15b0286415cce5599d4fb9f9ce02 \|…

		// Rewrite D = D0 * 2^K
		unsigned K = D.countTrailingZeros();
		lebedev.riUnsubmitted Done Reply Inline Actions This is begging to be wrong. Can you create two variables on different lines? lebedev.ri: This is begging to be wrong. Can you create two variables on different lines?
		lebedev.riUnsubmitted Done Reply Inline Actions Also, move the `W` here, to the use. lebedev.ri: Also, move the `W` here, to the use.
		APInt D0 = D.lshr(K);
		lebedev.riUnsubmitted Done Reply Inline Actions I would think just writing this as one line would be as clean APInt Q = APInt::getAllOnesValue(W).udiv(D0); lebedev.ri: I would think just writing this as one line would be as clean ``` APInt Q = APInt…

		// If D0 == 1, we cannot build this fold. Otherwise, D0 > 1, as needed.
		if (D0.isOneValue())
		return SDValue();

		// Calculate the multiplicative inverse P of D0 using Newton's method.
		lebedev.riUnsubmitted Done Reply Inline Actions I do not know if this is needed? At least rL338079 removed such a call. lebedev.ri: I do not know if this is needed? At least rL338079 removed such a call.
		RKSimonUnsubmitted Done Reply Inline Actions APInt::zext returns a value not a reference. RKSimon: APInt::zext returns a value not a reference.
		APInt tmp;
		APInt P = D0;
		while ((tmp = D0 * P) != 1)
		lebedev.riUnsubmitted Done Reply Inline Actions Drop the comment about signed cases? I strongly believe that should be done in a new function. lebedev.ri: Drop the comment about signed cases? I strongly believe that should be done in a new function.
		P *= APInt(D0.getBitWidth(), 2) - tmp;

		lebedev.riUnsubmitted Done Reply Inline Actions if(DivisorIsEven) { lebedev.ri: ``` if(DivisorIsEven) { ```
		RKSimonUnsubmitted Done Reply Inline Actions Again, value not a reference RKSimon: Again, value not a reference
		// Q = floor((2^W - 1) / D0)
		APInt Q = APInt::getAllOnesValue(W);
		if (IsSigned) {
		// Q' = floor((2^(W - 1) - 1) / D0)
		Q.clearSignBit();
		}
		Q = Q.udiv(D0);
		RKSimonUnsubmitted Done Reply Inline Actions You must ensure that MUL + ROTR are legal when necessary - pass a IsAfterLegalization flag into the function and check with isOperationLegalOrCustom/isOperationLegal - see TargetLowering::BuildSDIV RKSimon: You must ensure that MUL + ROTR are legal when necessary - pass a IsAfterLegalization flag into…
		if (IsSigned) {
		lebedev.riUnsubmitted Done Reply Inline Actions Drop the comment about signed cases? I strongly believe that should be done in a new function. lebedev.ri: Drop the comment about signed cases? I strongly believe that should be done in a new function.
		// Q'' = Q' & -2^K
		APInt Neg2K = APInt(W, 0);
		Neg2K.setHighBits(W - K);
		Q &= Neg2K;
		}

		SDValue PVal = DAG.getConstant(P, DL, REMVT);
		SDValue QVal = DAG.getConstant(Q, DL, REMVT);
		// (mul N, P)
		SDValue Op1 = DAG.getNode(ISD::MUL, DL, REMVT, REMNode->getOperand(0), PVal);
		DCI.AddToWorklist(Op1.getNode());
		// Will change in the signed case
		SDValue Op2 = QVal;

		if (IsSigned) {
		// (add (mul N, P), Q')
		Op1 = DAG.getNode(ISD::ADD, DL, REMVT, Op1, QVal);
		DCI.AddToWorklist(Op1.getNode());
		}

		// Rotate right only if D was even and thus shifted (K > 0)
		if (K) {
		SDValue ShAmt =
		DAG.getConstant(K, DL, getShiftAmountTy(REMVT, DAG.getDataLayout()));
		SDNodeFlags Flags;
		Flags.setExact(true);
		// UREM: (rotr (mul N, P), K)
		// SREM: (rotr (add (mul N, P), Q'), K)
		Op1 = DAG.getNode(ISD::ROTR, DL, REMVT, Op1, ShAmt, Flags);
		DCI.AddToWorklist(Op1.getNode());
		}

		// Don't insert a useless node if K - 1 == 0
		if (IsSigned && K != 1) {
		if (K != 0)
		Op2 = DAG.getConstant(Q.lshr(K - 1), DL, REMVT);
		else
		Op2 = DAG.getConstant(Q.shl(1), DL, REMVT);
		}

		// UREM: (setule/setugt (rotr (mul N, P), K), Q)
		// SREM: (setule/setugt (rotr (add (mul N, P), Q''), K), (srl Q', K-1))
		Op1 = DAG.getSetCC(DL, VT, Op1, Op2, (IsEq ? ISD::SETULE : ISD::SETUGT));
		return Op1;
		}

bool TargetLowering::		bool TargetLowering::
verifyReturnAddressArgumentIsConstant(SDValue Op, SelectionDAG &DAG) const {		verifyReturnAddressArgumentIsConstant(SDValue Op, SelectionDAG &DAG) const {
if (!isa<ConstantSDNode>(Op.getOperand(0))) {		if (!isa<ConstantSDNode>(Op.getOperand(0))) {
DAG.getContext()->emitError("argument to '__builtin_return_address' must "		DAG.getContext()->emitError("argument to '__builtin_return_address' must "
"be a constant integer");		"be a constant integer");
return true;		return true;
}		}

▲ Show 20 Lines • Show All 795 Lines • Show Last 20 Lines

test/CodeGen/X86/jump_sign.ll

	Show First 20 Lines • Show All 230 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: testb %al, %al			; CHECK-NEXT: testb %al, %al
	; CHECK-NEXT: jne .LBB12_5			; CHECK-NEXT: jne .LBB12_5
	; CHECK-NEXT: # %bb.3: # %sw.bb			; CHECK-NEXT: # %bb.3: # %sw.bb
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: testb %al, %al			; CHECK-NEXT: testb %al, %al
	; CHECK-NEXT: jne .LBB12_8			; CHECK-NEXT: jne .LBB12_8
	; CHECK-NEXT: # %bb.4: # %if.end29			; CHECK-NEXT: # %bb.4: # %if.end29
	; CHECK-NEXT: movzwl (%eax), %eax			; CHECK-NEXT: movzwl (%eax), %eax
				; CHECK-NEXT: imull $-13107, %eax, %eax
				; CHECK-NEXT: rorw $1, %ax
	; CHECK-NEXT: movzwl %ax, %eax			; CHECK-NEXT: movzwl %ax, %eax
	; CHECK-NEXT: imull $52429, %eax, %ecx # imm = 0xCCCD			; CHECK-NEXT: cmpl $13108, %eax
	; CHECK-NEXT: shrl $19, %ecx			; CHECK-NEXT: jae .LBB12_5
	; CHECK-NEXT: addl %ecx, %ecx
	; CHECK-NEXT: leal (%ecx,%ecx,4), %ecx
	; CHECK-NEXT: cmpw %cx, %ax
	; CHECK-NEXT: jne .LBB12_5
	; CHECK-NEXT: .LBB12_8: # %if.then44			; CHECK-NEXT: .LBB12_8: # %if.then44
				lebedev.riUnsubmitted Not Done Reply Inline Actions What do you mean by breaks? Is this a miscompilation? lebedev.ri: What do you mean by breaks? Is this a miscompilation?
				hermordUnsubmitted Not Done Reply Inline Actions Oh my bad, to clarify: I was referring to `test1` in this file and not the one that's changed here. This one is fine, but the code produced for `test1` after this patch would be less optimal than before. The output of `BuildUREMEqFold` just doesn't seem to fall into patterns that we're already optimizing well in this particular case (`N` is extended from `i1`, then `AND`'ed and then passed into an UREM), and we'd probably have to do a new `KnownBits`-based optimization elsewhere to avoid this. Via `InstCombine` this can actually be fixed without close to no extra overhead because we are already computing the necessary bits anyway. hermord: Oh my bad, to clarify: I was referring to `test1` in this file and not the one that's changed…
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: testb %al, %al			; CHECK-NEXT: testb %al, %al
	; CHECK-NEXT: je .LBB12_9			; CHECK-NEXT: je .LBB12_9
	; CHECK-NEXT: # %bb.10: # %if.else.i104			; CHECK-NEXT: # %bb.10: # %if.else.i104
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	; CHECK-NEXT: .LBB12_5: # %sw.default			; CHECK-NEXT: .LBB12_5: # %sw.default
	; CHECK-NEXT: xorl %eax, %eax			; CHECK-NEXT: xorl %eax, %eax
	; CHECK-NEXT: testb %al, %al			; CHECK-NEXT: testb %al, %al
	▲ Show 20 Lines • Show All 149 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: movl a, %eax			; CHECK-NEXT: movl a, %eax
	; CHECK-NEXT: testb %al, %cl			; CHECK-NEXT: testb %al, %cl
	; CHECK-NEXT: je .LBB18_2			; CHECK-NEXT: je .LBB18_2
	; CHECK-NEXT: # %bb.1: # %if.then			; CHECK-NEXT: # %bb.1: # %if.then
	; CHECK-NEXT: decl %eax			; CHECK-NEXT: decl %eax
	; CHECK-NEXT: movl %eax, a			; CHECK-NEXT: movl %eax, a
	; CHECK-NEXT: .LBB18_2: # %if.end			; CHECK-NEXT: .LBB18_2: # %if.end
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	entry:			entry:
				hermordUnsubmitted Not Done Reply Inline Actions This is the `test1` regression I mentioned in previous updates. I've made it explicit now, for the lack of a clean fix that I can see (without `computeKnownBits`). hermord: This is the `test1` regression I mentioned in previous updates. I've made it explicit now, for…
				RKSimonUnsubmitted Not Done Reply Inline Actions Have you made any progress working out what the problems is with test1? RKSimon: Have you made any progress working out what the problems is with test1?
	%t0 = load i32, i32* @b, align 4			%t0 = load i32, i32* @b, align 4
	%cmp = icmp ult i32 %t0, %p1			%cmp = icmp ult i32 %t0, %p1
	%conv = zext i1 %cmp to i32			%conv = zext i1 %cmp to i32
	%t1 = load i32, i32* @a, align 4			%t1 = load i32, i32* @a, align 4
	%and = and i32 %conv, %t1			%and = and i32 %conv, %t1
	%conv1 = trunc i32 %and to i8			%conv1 = trunc i32 %and to i8
	%t2 = urem i8 %conv1, 3			%t2 = urem i8 %conv1, 3
	%tobool = icmp eq i8 %t2, 0			%tobool = icmp eq i8 %t2, 0
	Show All 13 Lines

test/CodeGen/X86/rem-seteq-optsize.ll

This file was added.

				; RUN: llc < %s -mtriple=i386-unknown-unknown \| FileCheck %s

				; On X86, division in expensive. BuildRemEqFold should therefore run even
				; when optimizing for size. Only optimizing for minimum size retains a plain div.

				define i32 @test_minsize(i32 %X) optsize minsize nounwind readnone {
				; CHECK: divl
				; CHECK-NOT: imull
				%rem = urem i32 %X, 5
				%cmp = icmp eq i32 %rem, 0
				%ret = select i1 %cmp, i32 42, i32 -10
				ret i32 %ret
				}

				define i32 @test_optsize(i32 %X) optsize nounwind readnone {
				; CHECK: imull
				; CHECK-NOT: divl
				%rem = urem i32 %X, 5
				%cmp = icmp eq i32 %rem, 0
				%ret = select i1 %cmp, i32 42, i32 -10
				ret i32 %ret
				}

test/CodeGen/X86/rem-seteq.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=i386-unknown-unknown \| FileCheck %s
				lebedev.riUnsubmitted Not Done Reply Inline Actions Any reason this is targeting `i386` specifically? I think we should test: ; RUN: llc -mtriple=i686-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X86 ; RUN: llc -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s --check-prefixes=CHECK,X64,NOBMI2 ; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+bmi2 < %s \| FileCheck %s --check-prefixes=CHECK,X64,BMI2 lebedev.ri: Any reason this is targeting `i386` specifically? I //think// we should test: ``` ; RUN: llc…

				; This tests the BuildREMEqFold optimization with UREM, i32, odd divisor, SETEQ.
				; The corresponding pseudocode is:
				; Q <- [N * multInv(5, 2^32)] <=> [N * 0xCCCCCCCD] <=> [N * (-858993459)]
				; res <- [Q <= (2^32 - 1) / 5] <=> [Q <= 858993459] <=> [Q < 858993460]
				define i32 @test_urem_odd(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_urem_odd:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $-858993459, {{[0-9]+}}(%esp), %ecx # imm = 0xCCCCCCCD
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $858993460, %ecx # imm = 0x33333334
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = urem i32 %X, 5
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This is like test_urem_odd, except the divisor has bit 30 set.
				define i32 @test_urem_odd_bit30(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_urem_odd_bit30:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $1789569707, {{[0-9]+}}(%esp), %ecx # imm = 0x6AAAAAAB
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $4, %ecx
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = urem i32 %X, 1073741827
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This is like test_urem_odd, except the divisor has bit 31 set.
				define i32 @test_urem_odd_bit31(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_urem_odd_bit31:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $715827883, {{[0-9]+}}(%esp), %ecx # imm = 0x2AAAAAAB
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $2, %ecx
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = urem i32 %X, 2147483651
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This tests the BuildREMEqFold optimization with UREM, i16, even divisor, SETNE.
				; In this case, D <=> 14 <=> 7 * 2^1, so D0 = 7 and K = 1.
				; The corresponding pseudocode is:
				; Q <- [N * multInv(D0, 2^16)] <=> [N * multInv(7, 2^16)] <=> [N * 28087]
				; Q <- [Q >>rot K] <=> [Q >>rot 1]
				; res <- ![Q <= (2^16 - 1) / 7] <=> ![Q <= 9362] <=> [Q > 9362]
				define i16 @test_urem_even(i16 %X) nounwind readnone {
				; CHECK-LABEL: test_urem_even:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $28087, {{[0-9]+}}(%esp), %eax # imm = 0x6DB7
				; CHECK-NEXT: rorw $1, %ax
				; CHECK-NEXT: movzwl %ax, %ecx
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $9362, %ecx # imm = 0x2492
				; CHECK-NEXT: seta %al
				; CHECK-NEXT: # kill: def $ax killed $ax killed $eax
				; CHECK-NEXT: retl

				entry:
				%0 = urem i16 %X, 14
				%cmp = icmp ne i16 %0, 0
				%ret = zext i1 %cmp to i16
				ret i16 %ret
				}

				; This is like test_urem_even, except the divisor has bit 30 set.
				define i32 @test_urem_even_bit30(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_urem_even_bit30:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $-51622203, {{[0-9]+}}(%esp), %ecx # imm = 0xFCEC4EC5
				; CHECK-NEXT: rorl $3, %ecx
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $32, %ecx
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = urem i32 %X, 1073741928
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This is like test_urem_odd, except the divisor has bit 31 set.
				define i32 @test_urem_even_bit31(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_urem_even_bit31:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $-1157956869, {{[0-9]+}}(%esp), %ecx # imm = 0xBAFAFAFB
				; CHECK-NEXT: rorl $1, %ecx
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $4, %ecx
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = urem i32 %X, 2147483750
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This tests the BuildREMEqFold optimization with SREM, i32, odd divisor, SETEQ.
				; The corresponding pseudocode is:
				; Q <- [N * multInv(31, 2^32)] <=> [N * 0xBDEF7BDF] <=> [N * (-1108378657)]
				; a <- [((2^31 - 1) / 31)] <=> 69273666
				; Q' <- Q + a
				; res <- [Q' <= 2a/2^0] <=> [Q' <= 138547332] <=> [Q' < 138547333]
				define i32 @test_srem_odd(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_srem_odd:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $-1108378657, {{[0-9]+}}(%esp), %ecx # imm = 0xBDEF7BDF
				; CHECK-NEXT: addl $69273666, %ecx # imm = 0x4210842
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $138547333, %ecx # imm = 0x8421085
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = srem i32 %X, 31
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This is the same as test_srem_odd, but the divisor is negative.
				; The output should be the same.
				define i32 @test_srem_odd_neg(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_srem_odd_neg:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $-1108378657, {{[0-9]+}}(%esp), %ecx # imm = 0xBDEF7BDF
				; CHECK-NEXT: addl $69273666, %ecx # imm = 0x4210842
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $138547333, %ecx # imm = 0x8421085
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = srem i32 %X, -31
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This is like test_srem_odd, except the divisor has bit 30 set.
				define i32 @test_srem_odd_bit30(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_srem_odd_bit30:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $-1824318633, {{[0-9]+}}(%esp), %ecx # imm = 0x93431B57
				; CHECK-NEXT: incl %ecx
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $3, %ecx
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = srem i32 %X, 1073741927
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This tests the BuildREMEqFold optimization with SREM, i8, even divisor, SETNE.
				; The corresponding pseudocode is (with 12 = 3 * 2^2 => D0 = 3, K = 2):
				; Q <- [N * multInv(D0, 2^8)] <=> [N * multInv(3, 2^8)] <=> [N * 0xAB] <=> [N * (-85)]
				; a <- [((2^7 - 1) / 3) & -2^2] <=> [42 & -4] <=> 40
				; Q' <- Q + a
				; Q' <- [Q' >>rot 2]
				; res <- ![Q' <= 2a/2^2 ] <=> ![Q' <= 20] <=> [Q' > 20]
				define i8 @test_srem_even(i8 %X) nounwind readnone {
				; CHECK-LABEL: test_srem_even:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: movb {{[0-9]+}}(%esp), %al
				; CHECK-NEXT: movb $-85, %cl
				; CHECK-NEXT: mulb %cl
				; CHECK-NEXT: addb $40, %al
				; CHECK-NEXT: rorb $2, %al
				; CHECK-NEXT: cmpb $20, %al
				; CHECK-NEXT: seta %al
				; CHECK-NEXT: retl

				entry:
				%0 = srem i8 %X, 12
				%cmp = icmp ne i8 %0, 0
				%ret = zext i1 %cmp to i8
				ret i8 %ret
				}

				; This is like test_srem_odd_bit30, except the divisor is even.
				define i32 @test_srem_even_bit30(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_srem_even_bit30:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $-51622203, {{[0-9]+}}(%esp), %ecx # imm = 0xFCEC4EC5
				; CHECK-NEXT: addl $8, %ecx
				; CHECK-NEXT: rorl $3, %ecx
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $3, %ecx
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = srem i32 %X, 1073741928
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; This is the same as test_srem_even_bit30, but the divisor is negative.
				; The output should be the same.
				define i32 @test_srem_even_neg(i32 %X) nounwind readnone {
				; CHECK-LABEL: test_srem_even_neg:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: imull $-51622203, {{[0-9]+}}(%esp), %ecx # imm = 0xFCEC4EC5
				; CHECK-NEXT: addl $8, %ecx
				; CHECK-NEXT: rorl $3, %ecx
				; CHECK-NEXT: xorl %eax, %eax
				; CHECK-NEXT: cmpl $3, %ecx
				; CHECK-NEXT: setb %al
				; CHECK-NEXT: retl

				entry:
				%0 = srem i32 %X, -1073741928
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; We should not proceed with this fold if the divisor is 1 or -1
				define i32 @test_srem_one(i32 %X) nounwind readnone {
				; CHECK-NOT: imull
				entry:
				%0 = urem i32 %X, 1
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

				; We can lower remainder of division by powers of two much better elsewhere;
				; also, BuildREMEqFold does not work when the only odd factor of the divisor is 1.
				; This ensures we don't touch powers of two.
				define i32 @test_urem_pow2(i32 %X) nounwind readnone {
				; CHECK-NOT: imull
				entry:
				%0 = urem i32 %X, 16
				%cmp = icmp eq i32 %0, 0
				%ret = zext i1 %cmp to i32
				ret i32 %ret
				}

test/CodeGen/X86/vselect-avx.ll

	Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines
	;			;
	; Make sure that the and is fed by the original mask.			; Make sure that the and is fed by the original mask.
	;			;
	; <rdar://problem/18819506>			; <rdar://problem/18819506>

	define void @test3(<4 x i32> %induction30, <4 x i16>* %tmp16, <4 x i16>* %tmp17, <4 x i16> %tmp3, <4 x i16> %tmp12) {			define void @test3(<4 x i32> %induction30, <4 x i16>* %tmp16, <4 x i16>* %tmp17, <4 x i16> %tmp3, <4 x i16> %tmp12) {
	; AVX1-LABEL: test3:			; AVX1-LABEL: test3:
	; AVX1: ## %bb.0:			; AVX1: ## %bb.0:
	; AVX1-NEXT: vpshufd {{.*#+}} xmm3 = xmm0[1,1,3,3]			; AVX1-NEXT: vpmulld LCPI{{.*}}(%rip), %xmm0, %xmm0
	; AVX1-NEXT: vmovdqa {{.*#+}} xmm4 = [1431655766,1431655766,1431655766,1431655766]			; AVX1-NEXT: vpaddd LCPI{{.*}}(%rip), %xmm0, %xmm0
	; AVX1-NEXT: vpmuldq %xmm4, %xmm3, %xmm3			; AVX1-NEXT: vpminud LCPI{{.*}}(%rip), %xmm0, %xmm3
	; AVX1-NEXT: vpmuldq %xmm4, %xmm0, %xmm4
	; AVX1-NEXT: vpshufd {{.*#+}} xmm4 = xmm4[1,1,3,3]
	; AVX1-NEXT: vpblendw {{.*#+}} xmm3 = xmm4[0,1],xmm3[2,3],xmm4[4,5],xmm3[6,7]
	; AVX1-NEXT: vpsrld $31, %xmm3, %xmm4
	; AVX1-NEXT: vpaddd %xmm4, %xmm3, %xmm3
	; AVX1-NEXT: vpmulld {{.*}}(%rip), %xmm3, %xmm3
	; AVX1-NEXT: vpsubd %xmm3, %xmm0, %xmm0
	; AVX1-NEXT: vpxor %xmm3, %xmm3, %xmm3
	; AVX1-NEXT: vpcmpeqd %xmm3, %xmm0, %xmm0			; AVX1-NEXT: vpcmpeqd %xmm3, %xmm0, %xmm0
	; AVX1-NEXT: vblendvps %xmm0, %xmm1, %xmm2, %xmm1			; AVX1-NEXT: vblendvps %xmm0, %xmm1, %xmm2, %xmm1
	; AVX1-NEXT: vpackssdw %xmm0, %xmm0, %xmm0			; AVX1-NEXT: vpackssdw %xmm0, %xmm0, %xmm0
	; AVX1-NEXT: vmovq %xmm0, (%rdi)			; AVX1-NEXT: vmovq %xmm0, (%rdi)
	; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm1[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]			; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm1[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
	; AVX1-NEXT: vmovq %xmm0, (%rsi)			; AVX1-NEXT: vmovq %xmm0, (%rsi)
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: test3:			; AVX2-LABEL: test3:
	; AVX2: ## %bb.0:			; AVX2: ## %bb.0:
	; AVX2-NEXT: vpbroadcastd {{.*#+}} xmm3 = [1431655766,1431655766,1431655766,1431655766]			; AVX2-NEXT: vpbroadcastd {{.*#+}} xmm3 = [2863311531,2863311531,2863311531,2863311531]
	; AVX2-NEXT: vpshufd {{.*#+}} xmm4 = xmm3[1,1,3,3]			; AVX2-NEXT: vpmulld %xmm3, %xmm0, %xmm0
	; AVX2-NEXT: vpshufd {{.*#+}} xmm5 = xmm0[1,1,3,3]			; AVX2-NEXT: vpbroadcastd {{.*#+}} xmm3 = [715827882,715827882,715827882,715827882]
	; AVX2-NEXT: vpmuldq %xmm4, %xmm5, %xmm4			; AVX2-NEXT: vpaddd %xmm3, %xmm0, %xmm0
	; AVX2-NEXT: vpmuldq %xmm3, %xmm0, %xmm3			; AVX2-NEXT: vpbroadcastd {{.*#+}} xmm3 = [1431655764,1431655764,1431655764,1431655764]
	; AVX2-NEXT: vpshufd {{.*#+}} xmm3 = xmm3[1,1,3,3]			; AVX2-NEXT: vpminud %xmm3, %xmm0, %xmm3
	; AVX2-NEXT: vpblendd {{.*#+}} xmm3 = xmm3[0],xmm4[1],xmm3[2],xmm4[3]
	; AVX2-NEXT: vpsrld $31, %xmm3, %xmm4
	; AVX2-NEXT: vpaddd %xmm4, %xmm3, %xmm3
	; AVX2-NEXT: vpbroadcastd {{.*#+}} xmm4 = [3,3,3,3]
	; AVX2-NEXT: vpmulld %xmm4, %xmm3, %xmm3
	; AVX2-NEXT: vpsubd %xmm3, %xmm0, %xmm0
	; AVX2-NEXT: vpxor %xmm3, %xmm3, %xmm3
	; AVX2-NEXT: vpcmpeqd %xmm3, %xmm0, %xmm0			; AVX2-NEXT: vpcmpeqd %xmm3, %xmm0, %xmm0
	; AVX2-NEXT: vblendvps %xmm0, %xmm1, %xmm2, %xmm1			; AVX2-NEXT: vblendvps %xmm0, %xmm1, %xmm2, %xmm1
	; AVX2-NEXT: vpackssdw %xmm0, %xmm0, %xmm0			; AVX2-NEXT: vpackssdw %xmm0, %xmm0, %xmm0
	; AVX2-NEXT: vmovq %xmm0, (%rdi)			; AVX2-NEXT: vmovq %xmm0, (%rdi)
	; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm1[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]			; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm1[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
	; AVX2-NEXT: vmovq %xmm0, (%rsi)			; AVX2-NEXT: vmovq %xmm0, (%rsi)
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%tmp6 = srem <4 x i32> %induction30, <i32 3, i32 3, i32 3, i32 3>			%tmp6 = srem <4 x i32> %induction30, <i32 3, i32 3, i32 3, i32 3>
	▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 161419

include/llvm/CodeGen/TargetLowering.h

lib/CodeGen/SelectionDAG/TargetLowering.cpp

test/CodeGen/X86/jump_sign.ll

test/CodeGen/X86/rem-seteq-optsize.ll

test/CodeGen/X86/rem-seteq.ll

test/CodeGen/X86/vselect-avx.ll

[CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)
AbandonedPublic