This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)
AbandonedPublic

Authored by xbolva00 on Aug 2 2018, 8:58 PM.

Details

Summary

This implements an optimization described in Hacker's Delight 10-17: when C is constant, the result of X % C == 0 can be computed more cheaply without actually calculating the remainder. The motivation is discussed here: https://bugs.llvm.org/show_bug.cgi?id=35479.

Original patch author (Dmytro Shynkevych) Notes:

  • In principle, it's possible to also handle the X % C1 == C2 case, as discussed on bugzilla. This seems to require an extra branch on overflow, so I refrained from implementing this for now.
  • An explicit check for when the REM can be reduced to just its LHS is included: the X % C == 0 optimization breaks test1 in test/CodeGen/X86/jump_sign.ll otherwise. I hadn't managed to find a better way to not generate worse output in this case.
  • I haven't contributed to LLVM before, so I tried to select reviewers based on who I saw in other reviews. In particular, @kparzysz: a Hexagon test is modified and I have no familiarity with the architecture; hopefully my changes are valid.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hermord added inline comments.Aug 2 2018, 9:07 PM
test/CodeGen/Hexagon/swp-const-tc2.ll
9

To the best of my understanding, it is valid to not perform the multiplication at all, since its can never be used because of the infinite loop at b3. This is the output I get:

// %bb.0:                               // %b0
	{
		r0 = ##-1431655765
	}
	{
		r0 = #0
	}
	{
		r0 = memw(r0+#0)
	}
	.p2align	4
.LBB0_1:                                // %b3
                                        // =>This Inner Loop Header: Depth=1
	{
		jump .LBB0_1
	}
hermord added inline comments.Aug 2 2018, 9:08 PM
test/CodeGen/Hexagon/swp-const-tc2.ll
9

*its result can never be used

hermord added a reviewer: foad.Aug 2 2018, 9:12 PM
xbolva00 edited reviewers, added: craig.topper; removed: foad.Aug 3 2018, 3:31 AM
xbolva00 added a subscriber: xbolva00.
RKSimon added inline comments.Aug 3 2018, 4:14 AM
include/llvm/CodeGen/TargetLowering.h
3502

Please can you not include the Divisor - we are aiming for full vector support for divisions, not just splats - see D50185

kparzysz added inline comments.Aug 3 2018, 12:34 PM
test/CodeGen/Hexagon/swp-const-tc2.ll
9

Can you make these changes instead?

-define void @f0() {
+define i32 @f0(i32* %a0) {
 b0:
   br label %b1
 
 b1:                                               ; preds = %b1, %b0
   %v0 = phi i32 [ 0, %b0 ], [ %v9, %b1 ]
   %v1 = phi i32 [ 0, %b0 ], [ %v8, %b1 ]
-  %v2 = load i32, i32* undef, align 4
+  %v2 = load i32, i32* %a0, align 4
   %v3 = add nsw i32 %v1, 1
   %v4 = srem i32 %v2, 3
   %v5 = icmp ne i32 %v4, 0
   %v6 = sub nsw i32 0, %v2
   %v7 = select i1 %v5, i32 %v6, i32 %v2
   %v8 = mul nsw i32 %v3, %v7
   %v9 = add nsw i32 %v0, 1
   %v10 = icmp eq i32 %v9, 1
   br i1 %v10, label %b2, label %b1
 
 b2:                                               ; preds = %b1
   %v11 = phi i32 [ %v8, %b1 ]
   br label %b3
 
 b3:                                               ; preds = %b3, %b2
-  br label %b3
+  ret i32 %v11
 }
hermord updated this revision to Diff 159124.Aug 3 2018, 4:20 PM

Made changes as requested.

majnemer added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3635

Is this supposed to be ashr even if we are doing UREM? Seems a little unusual but if this is correct, it probably earns a comment.

hermord updated this revision to Diff 159227.Aug 5 2018, 10:10 AM

@majnemer Thanks; this was, in fact, incorrect. Now, to simplify the logic, the absolute value of D is taken and lshr is used.

@majnemer Thanks; this was, in fact, incorrect. Now, to simplify the logic, the absolute value of D is taken and lshr is used.

There were non-NFC changes to the code, but there was no test changes at all.
I suspect the test coverage is seriously lacking.

majnemer added inline comments.Aug 5 2018, 2:39 PM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3632–3639

The sign bit may be set for APInts fed into URem operations: all isNegative does is check that the MSB is set. I don't think it is OK to branch on the sign bit here without considering if we are in dealing with URem or SRem. I think what you want is:

if (IsSigned) {
  D0 = D.ashr(K);
} else {
  D0 = D.lshr(K);
}
hermord updated this revision to Diff 159505.Aug 7 2018, 8:24 AM

As pointed out by @lebedev.ri, tests were inadequate. Added new tests and stepped through existing ones, adding descriptions. This uncovered two more bugs, which were also fixed here (an extraneous Q.lshrInPlace() and division by D instead of D0). As far as I can see, the coverage now seems reasonable; please point out any cases I missed.

Regarding @majnemer's suggestion, my bad: I should have mentioned in the last update that this algorithm actually assumes that D is positive ("because d is odd and, as we are assuming, positive and not equal to 1..." [Hacker's Delight, p.227]). It probably can be adapted to the D < 0 case with some tweaks, but it seems easier to just take the absolute value of D since this does not change the answer.

A few passing-by thoughts.

include/llvm/CodeGen/TargetLowering.h
3502

This should be SmallVectorImpl<SDNode *> &Created like in all the other cases.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3372–3392

I think this is independent from the rest?
Can you split it into a separate review?

3394–3397

This feels slightly backwards.
Can you do this in DAGCombiner::visitSETCC(), and thus avoid all this use-dance?

18133–18135

Would be good to have a test (in a new file) for that.

18135

Please format your changes with clang-format (you can automate that via git pre-commit hook)

18142
SmallVector<SDNode *, 8> Built;

like in all the other functions.

18143–18145

Early return please.

SDValue NewCond = TLI.BuildREMEqFold(REMNode, CondNode, DAG, Built);

if (!NewCond)
  return SDValue();
18146–18151

Hm. Why is this different from what DAGCombiner::BuildUDIV() does?
I expected to see just the:

for (SDNode *N : Built)
  AddToWorklist(N);
return NewCond;
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606

This should be SmallVectorImpl<SDNode *> &Created like in all the other cases.

3631

Separate with newline.

3633–3634

Add self-explanatory assertion?

3645

This is begging to be wrong. Can you create two variables on different lines?

test/CodeGen/Hexagon/swp-const-tc2.ll
36

I think the change to this test can land right away separately from the rest of the changes.

test/CodeGen/X86/rem.ll
81

Why not instead put all this into a new test/CodeGen/X86/rem-seteq.ll file?
I'm not sure whether it is worth it splitting the urem and srem tests into different files though.

Also, really minor, when adressing/fixing an inline comment, could you please actually mark that inline comment as fixed (via that checkbox), else it really isn't clear in which state which comment is..

hermord updated this revision to Diff 161419.Aug 20 2018, 12:06 AM

Thank you for the review, @lebedev.ri. Sorry for the delay; resolved most of the issues. Quick summary of the changes:

  1. BuildREMEqFold is now invoked from SimplifySetCC, as it logically should be. The signature is changed to mirror the other SimplifySetCC helpers.
  2. Tests are moved to a separate file. The format is slightly changed to get rid of irrelevant selects. Size optimization tests are added. The Hexagon test fix is split into a separate patch: D50944.
  3. The | LHS | < | RHS | optimization (call it (*)) is removed from this patch.

Note: since changes were split into two patches, this now fails tests until D50944 gets merged. Even when that happens, this will still fail test1 in CodeGen/X86/jump_sign because of (*). This failure should probably be fixed by implementing the (*) in a separate patch, but I'm unsure as to the best way to go about it: since BuilREMEqFold moved out of visitREM, it will get invoked before any optimizations in visitREM have had a chance to run, as the SETCC node is processed first. As I see it, there are the following options:

  1. Move BuildREMEqFold back into visitREM. This is ugly for other reasons, as @lebedev.ri pointed out, but it does solve this particular hurdle. (*) can then be added right before it.
  2. Make (*) a separate function and invoke it in both visitREM and SimplifySetCC (before calling BuildREMEqFold). This may result in it running twice, which is probably bad. Computing KnownBits looks like it's expensive.
  3. Check if (*) is possible in SimplifySetCC and, if so, don't touch the node. visitREM then implements the actual optimization.
  4. Ignore (*) and implement an equivalent optimization for multiplication on the output of BuildREMEqFold.
  5. Accept the probably rare regression (dividend must be known to be smaller than the divisor AND the result has to be fed into a comparison with zero) and edit the test.

I lack experience to make the right call here and would much appreciate any suggestions.

hermord marked 4 inline comments as done.Aug 20 2018, 12:12 AM
hermord added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3633–3634

Sorry, I couldn't figure out what the assertion should be. The negative case can still be handled just fine; do you mean that I should assert (D0 =/= 1) and then make the caller check for that?

lebedev.ri added inline comments.Aug 20 2018, 12:18 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3372–3392

You could put this fold into the middle-end, into instsimplify, if we don't do it already.
But since it needs computeKnownBits(), some measurements would be needed
(how many times does it fire on some codebase? possibly, measure compile-time impact)

Some more high-level nits/thoughs:

  • Other than -Os, do we always want to do this, for every target, with no customization hook?
  • Right now the tests look good (other than the run-line question), please commit them now.
  • Similarly, if we want to do it for other arches, also commit a copy of that same rem-seteq.ll test for aarch64 at least.
  • I think this should be further split into two reviews, unsigned case first :/
include/llvm/CodeGen/TargetLowering.h
3696

Separate with newline?

lib/CodeGen/SelectionDAG/TargetLowering.cpp
2767–2769
if (SDValue Folded = BuildREMEqFold(VT, N0, N1, Cond, DCI, dl))
  return Folded;
3608–3610

The produced patterns are rather different between unsigned and signed cases.
I think this should be at least done in two steps (two reviews, unsigned first),
and likely these should be done by two separate functions.

test/CodeGen/X86/rem-seteq.ll
2 ↗(On Diff #161419)

Any reason this is targeting i386 specifically?
I think we should test:

; RUN: llc -mtriple=i686-unknown-linux-gnu                < %s | FileCheck %s --check-prefixes=CHECK,X86
; RUN: llc -mtriple=x86_64-unknown-linux-gnu              < %s | FileCheck %s --check-prefixes=CHECK,X64,NOBMI2
; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+bmi2 < %s | FileCheck %s --check-prefixes=CHECK,X64,BMI2
hermord updated this revision to Diff 163245.Aug 29 2018, 6:32 PM
hermord retitled this revision from [CodeGen] [SelectionDAG] More efficient code for X % C == 0 to [CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case).

Thank you for the review, @lebedev.ri, addressed:

    • Added isIntDivCheap as an additional condition preventing this optimization. If this isn't customizable enough, we could probably do something like if (isIntDivCheap || (minsize && isIntDivShort)) where isIntDivShort can be overriden per-target.
  • I'm not an LLVM developer; as far as I understand, I can't commit anything
  • Added tests for AArch64
  • Removed the signed bits

CodeGen/X86/jump_sign.ll still isn't passing; I've been trying different things, but none seem to make for a satisying fix (if I introduce the optimizaion in InstCombine, I'll still have to change the test to pipe through opt, which I'm not sure is better than just chaning the test to match current output). On a related note, after looking at it, should this entire thing be in InstCombine, actually?

hermord marked 3 inline comments as done.Aug 29 2018, 6:35 PM

Baseline tests committed, rebase please.

test/CodeGen/X86/urem-seteq.ll
4 ↗(On Diff #163245)

Actually, looking at the check-lines, i'm not sure we want to check the bmi2 version.
I'm not seeing anything there that would benefit from anything above baseline, i think.

lebedev.ri added inline comments.Aug 30 2018, 2:39 AM
test/CodeGen/X86/jump_sign.ll
239–244

What do you mean by breaks? Is this a miscompilation?

test/CodeGen/X86/urem-seteq.ll
4 ↗(On Diff #163245)

I've meant to drop this check-line, but apparently forgot.
I'm not 100% sure if we want/don't want it.

Starting to look much better. Some more nits.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3608
assert((Cond == ISD::SETEQ || Cond == ISD::SETNE) &&
       "Only applicable for [in]equality comparisons.");
3608
// fold (seteq/ne (urem N, D), 0) -> (setule/ugt (rotr (mul N, P), K), Q)

(it would be great to use more descriptive names to differentiate between Constants and not constants, too)

3630
// Decompose D into D0 * 2^K
3631
bool DivisorIsEven = K != 0;
3634–3636

So what we are really checking here, is that D is not power of two.
How about simplifying this into:

APInt D = Divisor->getAPIntValue();
unsigned W = D.getBitWidth();

// If D is power of two, D0 would be 1, and we cannot build this fold.
if(D.isPowerOf2())
  return SDValue();

// Decompose D into D0 * 2^K
...

BUT. I do not understand from where does this requirement comes from?
Can you quote specific part of [[ https://doc.lagout.org/security/Hackers%20Delight.pdf | 10–17 Test for Zero Remainder after Division by a Constant ]] that warrants this?

3638–3642

Would [[ http://llvm.org/doxygen/classllvm_1_1APInt.html#a28ee15b0286415cce5599d4fb9f9ce02 | APInt::multiplicativeInverse() ]] work here?

3645–3646

I would think just writing this as one line would be as clean

APInt Q = APInt::getAllOnesValue(W).udiv(D0);
3652

I do not know if this is needed?
At least rL338079 removed such a call.

3657
if(DivisorIsEven) {
lebedev.ri added inline comments.Aug 30 2018, 6:06 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3608

(it would be great to use more descriptive names to differentiate between Constants and not constants, too)

But thinking about it a bit more, probably don't rename the variables. Consistency [with the orig doc] may be best.

lebedev.ri added inline comments.Aug 30 2018, 6:55 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3627–3636

Well, i see it now - https://rise4fun.com/Alive/aXUu.
Then this should be:

APInt D = Divisor->getAPIntValue();

if(D.isPowerOf2()) {
  // rem by power-of-two is better represented by and-mask.
  return SDValue();
}

// Decompose D into D0 * 2^K
...
3645

Also, move the W here, to the use.

hermord updated this revision to Diff 163461.Aug 30 2018, 8:50 PM

Comments addressed. The minsize condition needs some tweaking, it seems: the code with it works out to actually be longer on X86. Perhaps there should really be something like isIntDivShort.

Re: test1, by it being broken I only mean that it doesn't pass: it's not a miscompilation, just a less optimal output than desired.

hermord marked 10 inline comments as done.Aug 30 2018, 9:25 PM
hermord added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3634–3636

The algorithm also fails in general in this case because the inverse of 1 is 1 itself, and we get, for example:

Given: N: 1, D: 4, C: 0

Then:
K = 2
D0 = 1
P = 1
Q = (2^32 - 1)/1 = 2^32 - 1

And so: (P >>rot 2) == 2^30 < 2^32 - 1 == Q, so the condition holds,
but N % D != 0

The book mentions this in passing: "This can be simplified a little by observing that because d is odd and, as we are assuming, positive and not equal to 1 [...]" (p. 227).

test/CodeGen/X86/jump_sign.ll
239–244

Oh my bad, to clarify: I was referring to test1 in this file and not the one that's changed here. This one is fine, but the code produced for test1 after this patch would be less optimal than before. The output of BuildUREMEqFold just doesn't seem to fall into patterns that we're already optimizing well in this particular case (N is extended from i1, then AND'ed and then passed into an UREM), and we'd probably have to do a new KnownBits-based optimization elsewhere to avoid this. Via InstCombine this can actually be fixed without close to no extra overhead because we are already computing the necessary bits anyway.

hermord updated this revision to Diff 163480.Aug 31 2018, 1:47 AM

Updated AArch64 tests.

This looks about right as far as i can tell.
But would be best if others could review, too.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3608

The comment change wasn't applied, i'm not aware of gte predicate, but i do know [u/s]ge.
Also, reading the actual code, ugt was meant it seems.

3613

You can move this closer to where it's actually used.

3630

You don't actually modify D as far as i can tell?
I think it should be const APInt &D

3640

Ok, thank you for the explanation regarding D0 != 1.
I think here now we can just add an assert, to document it.

APInt D0 = D.lshr(K);

assert(!D0.isOneValue() && "The fold is invalid for D0 of 1. Unreachable since we leave powers-of-two to be improved elsewhere.");
3655

Drop the comment about signed cases?
I strongly believe that should be done in a new function.

3665

Drop the comment about signed cases?
I strongly believe that should be done in a new function.

Please add uniform and non-uniform vector test cases as well.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3623

It should be trivial for you to add non-uniform vector support in this patch, following the patterns I did for the other divsion-by-constant builders.

And according to GCC mailing list (https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00193.html) and their stats, unsigned case of X % C1 == C2 is still worth to handle as well.

And according to GCC mailing list (https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00193.html) and their stats, unsigned case of X % C1 == C2 is still worth to handle as well.

Please add uniform and non-uniform vector test cases as well.

And according to GCC mailing list (https://gcc.gnu.org/ml/gcc-patches/2018-09/msg00193.html) and their stats, unsigned case of X % C1 == C2 is still worth to handle as well.

@hermord to reiterate, as far i'm concerned, this only needs vector tests.
Everything else should go into new reviews (srem, urem nonsplat, srem nonsplat; non-zero constants (another 4 reviews ideally?)
As for the vector tests, just add them here, i will precommit.
I think, you want to operate on <4 x i32>, with the run line

; RUN: llc -mtriple=x86_64-unknown-linux-gnu -mattr=+sse2 < %s | FileCheck %s --check-prefixes=CHECK,CHECK-SSE2

and

; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s | FileCheck %s

Just completely duplicate the existing test files with -vector suffix. I.e. if you have

define i32 @test_urem_odd(i32 %X) nounwind readnone {
  %urem = urem i32 %X, 5
  %cmp = icmp eq i32 %urem, 0
  %ret = zext i1 %cmp to i32
  ret i32 %ret
}

It will be

define <4 x i32> @test_urem_odd_vec(<4 x i32> %X) nounwind readnone {
  %urem = urem <4 x i32> %X, <i32 5, i32 5, i32 5, i32 5>
  %cmp = icmp eq <4 x i32> %urem, <i32 0, i32 0, i32 0, i32 0>
  %ret = zext <4 x i1> %cmp to <4 x i32>
  ret <4 x i32> %ret
}

I'm not quite sure about nonsplat tests. things to keep in mind there:

  • You have two constants, so add three positive tests with one constant element being undef (i.e. first test with undef in the first constant, second test with undef in the second constant, and a test with undef in both)
  • The codepath is different depending on whether the divisor is even. So you want to add a test where half of the divisor is even, and half is odd
  • ???
hermord updated this revision to Diff 164794.Sep 10 2018, 9:34 PM

Apologies again for the delay.

  • Made cosmetic changes as directed by @lebedev.ri and added vector tests.
  • Disabled the fold for vector divisors with even values (see inline comment and test_urem_even_vec_i16).

Nonsplat divisors, nonzero C2 and possibly target customization (something like bool shouldBuildUREMEqFold(SDValue Divisor, SDValue CompTarget, bool isSREM, const DAGCombinerInfo &DCI)?) will be in separate patches.

The splat tests are only partly the same as the scalar ones because bit30/bit31 is not handled any differently in vectors and so is redundant; on the other hand, testing the <4 x i16> case turned out to be valuable (identified the ROTR-induced crash descibed in inline comment).

hermord marked 5 inline comments as done.Sep 10 2018, 9:49 PM
hermord added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3623

I've got conflicting directions for this patch (add nonsplat here or submit it separately), so I didn't touch that part in this update.

3640

This turns out to actually be reachable: the logic that handles REM by powers of two is in visitREM, which is invoked after visitSetCC from where we would've come here. I kept this as an early return if that's OK.

test/CodeGen/X86/jump_sign.ll
410

This is the test1 regression I mentioned in previous updates. I've made it explicit now, for the lack of a clean fix that I can see (without computeKnownBits).

test/CodeGen/X86/urem-seteq.ll
4 ↗(On Diff #163245)

Should I drop it? On a related note, I could add AVX2 to the vector tests on X86 if that's likely to be useful.

RKSimon added inline comments.Sep 11 2018, 8:35 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3633

APInt::lshr returns a value not a reference.

3652

APInt::zext returns a value not a reference.

3657

Again, value not a reference

3664

You must ensure that MUL + ROTR are legal when necessary - pass a IsAfterLegalization flag into the function and check with isOperationLegalOrCustom/isOperationLegal - see TargetLowering::BuildSDIV

test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
2 ↗(On Diff #164794)

Run with avx2 as well for better test coverage.

6 ↗(On Diff #164794)

You don't need nonsplat in the test name - its in the filename.

test/CodeGen/X86/urem-seteq-vec-splat.ll
2 ↗(On Diff #164794)

Run with avx2 as well for better test coverage.

27 ↗(On Diff #164794)

Use legal types - 8 x i16 etc.

lebedev.ri accepted this revision.Sep 11 2018, 8:38 AM

Updated the tests in rL341953.
As far i'm concerned this is good to go.
Vector and vector non-splat support sounds like a great task for the first follow-up :)
But, still let's wait for others to comment for a bit..

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3640

Sounds good.

test/CodeGen/X86/urem-seteq.ll
4 ↗(On Diff #163245)

Right, good idea.
I checked all the various sse/avx versions, and kept the unique ones that make a difference here.

This revision is now accepted and ready to land.Sep 11 2018, 8:38 AM
hermord updated this revision to Diff 164993.Sep 11 2018, 4:07 PM

Thank you for the review, @RKSimon. Made changes as directed.

hermord marked 8 inline comments as done.Sep 11 2018, 4:09 PM
hermord marked an inline comment as done.

Thank you for the review, @RKSimon. Made changes as directed.

Can you please rebase your differential, so the diff is as compared to the svn?
(i have previously committed the vector tests, but they will need to be recommitted now).

xbolva00 accepted this revision.Sep 12 2018, 12:11 PM

Seems fine for trunk now..

RKSimon requested changes to this revision.Sep 12 2018, 1:41 PM
RKSimon added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606

Isn't this still missing?

test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
6 ↗(On Diff #165115)

Please add these test cases back.

test/CodeGen/X86/urem-seteq-vec-splat.ll
6 ↗(On Diff #165115)

Please add these test cases back.

This revision now requires changes to proceed.Sep 12 2018, 1:41 PM
hermord updated this revision to Diff 165417.Sep 13 2018, 9:21 PM

Re-added previously commited tests.

hermord marked 2 inline comments as done.Sep 13 2018, 9:25 PM
hermord added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606

The approach is different now: BuildUREMEqFold is now called from TargetLowering::SimplifySetCC and not from DAGCombiners::visitREM, and l tried to match the style of the other helpers called from there (like simplifySetCCWithAnd).

@RKSimon should I make any other changes to this?

Hm, i stalled the review here it seems, sorry.
See inline comment.

I think you only updated the splat vector tests to use legal vectors (<8 x i16> instead of <4 x i16>),
but did not update the non-splat test files?
Can you update them too please? I'll then re-commit the tests once more, hopefully the last time :)

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606

Looking at the code on which this was originally based, i think i was slightly wrong.
This should still have a pointer to smallvector, and it should push back the newly created SDValues.
And this function should be called from a new helper function that would make proper use of that.

See TargetLowering::BuildSDIVPow2() (which is the implementation), vs DAGCombiner::BuildSDIVPow2() (which is the wrapper).

Hm, i stalled the review here it seems, sorry.
See inline comment.

I think you only updated the splat vector tests to use legal vectors (<8 x i16> instead of <4 x i16>),
but did not update the non-splat test files?
Can you update them too please? I'll then re-commit the tests once more, hopefully the last time :)

hermord updated this revision to Diff 166869.Sep 25 2018, 6:33 AM

@lebedev.ri Sorry, couldn't find any instances of 4 x i16 in the nonsplat tests; I think I only had those in the splat ones. The patch no longer applied cleanly, so I rebased it again.

hermord added inline comments.Sep 25 2018, 6:38 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606

Ah sorry, just to clarify, there used to be a DAGCombiner::BuildREMEqFold in an earlier version of this patch, but my rationale for removing it was that:

  1. We wanted to generate this fold when visiting the SETCC and not the UREM.
  2. DAGCombiner::visitSetCC delegates to DAGCombiner::SimplifySetCC, which is just a stub calling TargetLowering::SimplifySetCC.
  3. TargetLowering::SimplifySetCC adds nodes to the worklist directly and doesn't return an SDValue vector.

If it's better to mimic BuildSDIVPow2 and have a wrapper, I could:

  1. Have TargetLowering::SimplifySetCC call the wrapper from DAGCombiner, at the risk of this being slightly confusing (I don't believe anything else in that function does that?)
  2. Remove the call to TargetLowering::BuildUREMEqFold from TargetLowering::SimplifySetCC and add a call to the wrapper instead to DAGCombiner::SimplifySetCC or DAGCombiner::visitSETCC. Currently neither actually contains any case-specific simplification logic, so I'm not sure if I should change that.

@lebedev.ri Sorry, couldn't find any instances of 4 x i16 in the nonsplat tests; I think I only had those in the splat ones.

Hm, could be. I don't see them either now.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606

This current BuildUREMEqFold() does not adds nodes to the worklist directly as far as i can see?

Well, i had the third variant in mind.

  1. Rename this function to something like BuildUREMEqFold_REAL() (bad name, just an example)
  2. Add a new wrapper BuildUREMEqFold() here in this file, right after BuildUREMEqFold_REAL().

So nothing changes for the callers.
And the wrapper will do all that stuff about receiving the vector of newly created SDNodes,
and adding them to the worklist.

hermord updated this revision to Diff 167025.Sep 25 2018, 4:39 PM

Made a wrapper as dicussed.

hermord added inline comments.Sep 25 2018, 4:52 PM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3606

It doesn't right now, this is true. I removed it following this comment:

Comment at: lib/CodeGen/SelectionDAG/TargetLowering.cpp:3771
+  SDValue Op1 = DAG.getNode(ISD::MUL, DL, REMVT, REMNode->getOperand(0), PVal);
+  DCI.AddToWorklist(Op1.getNode());
+  // Will change in the signed case
----------------
I do not know if this is needed?
At least rL338079 removed such a call.

Looking back at rL338079, it doesn't say that AddToWorklist is never necessary in this context, and similar code (BuildUDIV) does do it, so I removing it might have been premature.

I took a stab at the third variant in the latest diff.

Any futher blockers here?

RKSimon added inline comments.Oct 2 2018, 9:24 AM
test/CodeGen/AArch64/urem-seteq-vec-nonsplat.ll
6 ↗(On Diff #167025)

You can commit the nonsplat test name changes as an NFC now to reduce this patch.

test/CodeGen/X86/jump_sign.ll
410

Have you made any progress working out what the problems is with test1?

test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
10 ↗(On Diff #167025)

You can commit the nonsplat test name changes as an NFC now to reduce this patch.

test/CodeGen/X86/urem-seteq-vec-splat.ll
66 ↗(On Diff #167025)

These <4 x i16> -> <8 x i16> test changes need to be done as an NFC commit, showing the current codegen and then this patch rebased. Its up to you if you keep the aarch64 using <4 x i16> or not but the x86 versions need to be changed to a legal type.

xbolva00 removed a subscriber: xbolva00.

@hermord Are you still looking at this?

Btw, why we do not this transformation on IR-level at first?

spatel added a subscriber: spatel.Nov 24 2018, 7:43 AM
nikic added a subscriber: nikic.Nov 24 2018, 3:03 PM

@hermord Do you mind if I commandeer this to get it finished please?

@RKSimon I am sorry for the long radio silence. It is entirely my fault.

Regarding test1, the problem seems to be that once this fold is applied, producing something of the form (setule (mul t, -85), 85), it is difficult to deduce that this is the same as (seteq t, 0) in this case. In what was produced before, (seteq (sub t, (mul ...)), 0), a series of optimizations could figure out that the mul node is always zero.

This can be worked around in InstCombine without any additional KnownBits computations beyond what already happens. I'd looked into implementing a fix there before (mentioned in dicussion with @lebedev.ri) and could replicate it, but this will require running opt on jump_sign.ll to restore previous behavior.

I do not have commit rights, so I cannot enact the suggestion to commit the tests as NFC, but I can implement the above (or any other) approach for test1 now, if required, to finalize the patch. I feel like this is very close to completion, but if this is not acceptable, please feel free to commandeer the patch.

Ok, so create new revision with tests? Somebody will commit it for you.

hermord updated this revision to Diff 180430.Jan 6 2019, 7:23 PM
hermord marked an inline comment as done.

Rebased on top of D56372.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 4:26 AM
gnudles requested changes to this revision.EditedFeb 23 2019, 11:03 AM
gnudles added a subscriber: gnudles.

Hi guys, I found the magical formula for unsigned integers that works also with even numbers without the need to check for overflows with any remainder:
from divisor d and reminder r, I calculate 4 constants.
any d!=0 should fit.

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;/* note that I divide by d, not d_odd */
}

A little bit background: the constant (u +1) is the count of the possible values in the range of 64 bit number that will yield the correct modulo.
The constant s should zero the first k bits if the given value have modulo of r. it will also zero the modulo of d_odd.

then the compiler should generate the following code with the given constants:

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}

this replace the following:

/* d is the divisor, r is the remainder */
int checkrem64(uint64_t x)
{
  return x%d==r;
}

this is the code to find modular inverse..

uint64_t find_mod_mul_inverse(uint64_t x, uint64_t bits)
{
      if (bits > 64 || ((x&1)==0))
              return 0;// invalid parameters
      uint64_t mask;
      if (bits == 64)
              mask = -1;
      else
      {                
              mask = 1;
              mask<<=bits;
              mask--;
      }
      x&=mask;
      uint64_t result=1, state=x, ctz=0;
      while(state!=1ULL)
      {
              ctz=__builtin_ctzll(state^1);
              result|=1ULL<<ctz;
              state+=x<<ctz;
              state&=mask;
      }
      return result;
}

good luck!
I tested this on all the cases of 10bit word size, and it passed.

*Edit:* I looked for something that will work for signed integers. I came up with something that would work with negative numbers if the following assumption was correct:

(-21)%10==9

but this assumption is not correct because (-21)%10 equals to -1.
anyway, the idea is like that, you shift the range and change s and accordingly:

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
     //this is the added line to make it work with signed integers
      r+=0x8000 0000 0000 0000% d;
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;
}

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
     //this is the added line to make it work with signed integers
//x came as signed number but was casted to unsigned
    x^=0x8000 0000 0000 0000;// this is addition simplified to xor, spaces for clarity.
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}

but there must be a way to tweak u and s to make it work on negative only numbers or positive only numbers.... it should be easy...
Can someone please contact me to do the math?
I need a background and explanation how the folding of unsigned numbers should act and then I will try to find for you the formula... dvoreader@gmail.com
@jdoerfert, please read this carefully, cause you did not calculated correctly the Q constant when D is even. that is why it did not worked for you when D was 4, and it will also not work for you when D is 6 or 10 etc.

This revision now requires changes to proceed.Feb 23 2019, 11:03 AM
xbolva00 commandeered this revision.Jun 14 2019, 2:47 PM
xbolva00 updated this revision to Diff 204851.
xbolva00 added a reviewer: hermord.

Rebased. Moved to DAGCombiner and fixed regression in jump_sign.ll.

xbolva00 edited the summary of this revision. (Show Details)Jun 14 2019, 3:00 PM
xbolva00 planned changes to this revision.Jun 14 2019, 3:08 PM

Oh, no. Still there :((

xbolva00 abandoned this revision.Jun 14 2019, 3:27 PM

Hmm, i don't see the button to take the review,
so i had to post a new one: D63391
The test/CodeGen/X86/jump_sign.ll regression is rather trivial, and is being fixed by a followup patch D63390.