Page MenuHomePhabricator

[CodeGen] [SelectionDAG] More efficient code for X % C == 0 (UREM case)
AbandonedPublic

Authored by xbolva00 on Aug 2 2018, 8:58 PM.

Details

Summary

This implements an optimization described in Hacker's Delight 10-17: when C is constant, the result of X % C == 0 can be computed more cheaply without actually calculating the remainder. The motivation is discussed here: https://bugs.llvm.org/show_bug.cgi?id=35479.

Original patch author (Dmytro Shynkevych) Notes:

  • In principle, it's possible to also handle the X % C1 == C2 case, as discussed on bugzilla. This seems to require an extra branch on overflow, so I refrained from implementing this for now.
  • An explicit check for when the REM can be reduced to just its LHS is included: the X % C == 0 optimization breaks test1 in test/CodeGen/X86/jump_sign.ll otherwise. I hadn't managed to find a better way to not generate worse output in this case.
  • I haven't contributed to LLVM before, so I tried to select reviewers based on who I saw in other reviews. In particular, @kparzysz: a Hexagon test is modified and I have no familiarity with the architecture; hopefully my changes are valid.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Apologies again for the delay.

  • Made cosmetic changes as directed by @lebedev.ri and added vector tests.
  • Disabled the fold for vector divisors with even values (see inline comment and test_urem_even_vec_i16).

Nonsplat divisors, nonzero C2 and possibly target customization (something like bool shouldBuildUREMEqFold(SDValue Divisor, SDValue CompTarget, bool isSREM, const DAGCombinerInfo &DCI)?) will be in separate patches.

The splat tests are only partly the same as the scalar ones because bit30/bit31 is not handled any differently in vectors and so is redundant; on the other hand, testing the <4 x i16> case turned out to be valuable (identified the ROTR-induced crash descibed in inline comment).

hermord marked 5 inline comments as done.Sep 10 2018, 9:49 PM
hermord added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3738

I've got conflicting directions for this patch (add nonsplat here or submit it separately), so I didn't touch that part in this update.

3755

This turns out to actually be reachable: the logic that handles REM by powers of two is in visitREM, which is invoked after visitSetCC from where we would've come here. I kept this as an early return if that's OK.

test/CodeGen/X86/jump_sign.ll
410

This is the test1 regression I mentioned in previous updates. I've made it explicit now, for the lack of a clean fix that I can see (without computeKnownBits).

test/CodeGen/X86/urem-seteq.ll
4 ↗(On Diff #163245)

Should I drop it? On a related note, I could add AVX2 to the vector tests on X86 if that's likely to be useful.

RKSimon added inline comments.Sep 11 2018, 8:35 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3748

APInt::lshr returns a value not a reference.

3767

APInt::zext returns a value not a reference.

3772

Again, value not a reference

3779

You must ensure that MUL + ROTR are legal when necessary - pass a IsAfterLegalization flag into the function and check with isOperationLegalOrCustom/isOperationLegal - see TargetLowering::BuildSDIV

test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
2 ↗(On Diff #164794)

Run with avx2 as well for better test coverage.

6 ↗(On Diff #164794)

You don't need nonsplat in the test name - its in the filename.

test/CodeGen/X86/urem-seteq-vec-splat.ll
2 ↗(On Diff #164794)

Run with avx2 as well for better test coverage.

27 ↗(On Diff #164794)

Use legal types - 8 x i16 etc.

lebedev.ri accepted this revision.Sep 11 2018, 8:38 AM

Updated the tests in rL341953.
As far i'm concerned this is good to go.
Vector and vector non-splat support sounds like a great task for the first follow-up :)
But, still let's wait for others to comment for a bit..

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3755

Sounds good.

test/CodeGen/X86/urem-seteq.ll
4 ↗(On Diff #163245)

Right, good idea.
I checked all the various sse/avx versions, and kept the unique ones that make a difference here.

This revision is now accepted and ready to land.Sep 11 2018, 8:38 AM
hermord updated this revision to Diff 164993.Sep 11 2018, 4:07 PM

Thank you for the review, @RKSimon. Made changes as directed.

hermord marked 8 inline comments as done.Sep 11 2018, 4:09 PM
hermord marked an inline comment as done.

Thank you for the review, @RKSimon. Made changes as directed.

Can you please rebase your differential, so the diff is as compared to the svn?
(i have previously committed the vector tests, but they will need to be recommitted now).

xbolva00 accepted this revision.Sep 12 2018, 12:11 PM

Seems fine for trunk now..

RKSimon requested changes to this revision.Sep 12 2018, 1:41 PM
RKSimon added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721

Isn't this still missing?

test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
6 ↗(On Diff #165115)

Please add these test cases back.

test/CodeGen/X86/urem-seteq-vec-splat.ll
6 ↗(On Diff #165115)

Please add these test cases back.

This revision now requires changes to proceed.Sep 12 2018, 1:41 PM
hermord updated this revision to Diff 165417.Sep 13 2018, 9:21 PM

Re-added previously commited tests.

hermord marked 2 inline comments as done.Sep 13 2018, 9:25 PM
hermord added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721

The approach is different now: BuildUREMEqFold is now called from TargetLowering::SimplifySetCC and not from DAGCombiners::visitREM, and l tried to match the style of the other helpers called from there (like simplifySetCCWithAnd).

@RKSimon should I make any other changes to this?

Hm, i stalled the review here it seems, sorry.
See inline comment.

I think you only updated the splat vector tests to use legal vectors (<8 x i16> instead of <4 x i16>),
but did not update the non-splat test files?
Can you update them too please? I'll then re-commit the tests once more, hopefully the last time :)

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721

Looking at the code on which this was originally based, i think i was slightly wrong.
This should still have a pointer to smallvector, and it should push back the newly created SDValues.
And this function should be called from a new helper function that would make proper use of that.

See TargetLowering::BuildSDIVPow2() (which is the implementation), vs DAGCombiner::BuildSDIVPow2() (which is the wrapper).

Hm, i stalled the review here it seems, sorry.
See inline comment.

I think you only updated the splat vector tests to use legal vectors (<8 x i16> instead of <4 x i16>),
but did not update the non-splat test files?
Can you update them too please? I'll then re-commit the tests once more, hopefully the last time :)

hermord updated this revision to Diff 166869.Sep 25 2018, 6:33 AM

@lebedev.ri Sorry, couldn't find any instances of 4 x i16 in the nonsplat tests; I think I only had those in the splat ones. The patch no longer applied cleanly, so I rebased it again.

hermord added inline comments.Sep 25 2018, 6:38 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721

Ah sorry, just to clarify, there used to be a DAGCombiner::BuildREMEqFold in an earlier version of this patch, but my rationale for removing it was that:

  1. We wanted to generate this fold when visiting the SETCC and not the UREM.
  2. DAGCombiner::visitSetCC delegates to DAGCombiner::SimplifySetCC, which is just a stub calling TargetLowering::SimplifySetCC.
  3. TargetLowering::SimplifySetCC adds nodes to the worklist directly and doesn't return an SDValue vector.

If it's better to mimic BuildSDIVPow2 and have a wrapper, I could:

  1. Have TargetLowering::SimplifySetCC call the wrapper from DAGCombiner, at the risk of this being slightly confusing (I don't believe anything else in that function does that?)
  2. Remove the call to TargetLowering::BuildUREMEqFold from TargetLowering::SimplifySetCC and add a call to the wrapper instead to DAGCombiner::SimplifySetCC or DAGCombiner::visitSETCC. Currently neither actually contains any case-specific simplification logic, so I'm not sure if I should change that.

@lebedev.ri Sorry, couldn't find any instances of 4 x i16 in the nonsplat tests; I think I only had those in the splat ones.

Hm, could be. I don't see them either now.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721

This current BuildUREMEqFold() does not adds nodes to the worklist directly as far as i can see?

Well, i had the third variant in mind.

  1. Rename this function to something like BuildUREMEqFold_REAL() (bad name, just an example)
  2. Add a new wrapper BuildUREMEqFold() here in this file, right after BuildUREMEqFold_REAL().

So nothing changes for the callers.
And the wrapper will do all that stuff about receiving the vector of newly created SDNodes,
and adding them to the worklist.

hermord updated this revision to Diff 167025.Sep 25 2018, 4:39 PM

Made a wrapper as dicussed.

hermord added inline comments.Sep 25 2018, 4:52 PM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3721

It doesn't right now, this is true. I removed it following this comment:

Comment at: lib/CodeGen/SelectionDAG/TargetLowering.cpp:3771
+  SDValue Op1 = DAG.getNode(ISD::MUL, DL, REMVT, REMNode->getOperand(0), PVal);
+  DCI.AddToWorklist(Op1.getNode());
+  // Will change in the signed case
----------------
I do not know if this is needed?
At least rL338079 removed such a call.

Looking back at rL338079, it doesn't say that AddToWorklist is never necessary in this context, and similar code (BuildUDIV) does do it, so I removing it might have been premature.

I took a stab at the third variant in the latest diff.

Any futher blockers here?

RKSimon added inline comments.Oct 2 2018, 9:24 AM
test/CodeGen/AArch64/urem-seteq-vec-nonsplat.ll
6 ↗(On Diff #167025)

You can commit the nonsplat test name changes as an NFC now to reduce this patch.

test/CodeGen/X86/jump_sign.ll
410

Have you made any progress working out what the problems is with test1?

test/CodeGen/X86/urem-seteq-vec-nonsplat.ll
10 ↗(On Diff #167025)

You can commit the nonsplat test name changes as an NFC now to reduce this patch.

test/CodeGen/X86/urem-seteq-vec-splat.ll
66 ↗(On Diff #167025)

These <4 x i16> -> <8 x i16> test changes need to be done as an NFC commit, showing the current codegen and then this patch rebased. Its up to you if you keep the aarch64 using <4 x i16> or not but the x86 versions need to be changed to a legal type.

xbolva00 removed a subscriber: xbolva00.

@hermord Are you still looking at this?

Btw, why we do not this transformation on IR-level at first?

spatel added a subscriber: spatel.Nov 24 2018, 7:43 AM
nikic added a subscriber: nikic.Nov 24 2018, 3:03 PM

@hermord Do you mind if I commandeer this to get it finished please?

@RKSimon I am sorry for the long radio silence. It is entirely my fault.

Regarding test1, the problem seems to be that once this fold is applied, producing something of the form (setule (mul t, -85), 85), it is difficult to deduce that this is the same as (seteq t, 0) in this case. In what was produced before, (seteq (sub t, (mul ...)), 0), a series of optimizations could figure out that the mul node is always zero.

This can be worked around in InstCombine without any additional KnownBits computations beyond what already happens. I'd looked into implementing a fix there before (mentioned in dicussion with @lebedev.ri) and could replicate it, but this will require running opt on jump_sign.ll to restore previous behavior.

I do not have commit rights, so I cannot enact the suggestion to commit the tests as NFC, but I can implement the above (or any other) approach for test1 now, if required, to finalize the patch. I feel like this is very close to completion, but if this is not acceptable, please feel free to commandeer the patch.

Ok, so create new revision with tests? Somebody will commit it for you.

hermord updated this revision to Diff 180430.Jan 6 2019, 7:23 PM
hermord marked an inline comment as done.

Rebased on top of D56372.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 4:26 AM
gnudles requested changes to this revision.EditedFeb 23 2019, 11:03 AM
gnudles added a subscriber: gnudles.

Hi guys, I found the magical formula for unsigned integers that works also with even numbers without the need to check for overflows with any remainder:
from divisor d and reminder r, I calculate 4 constants.
any d!=0 should fit.

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;/* note that I divide by d, not d_odd */
}

A little bit background: the constant (u +1) is the count of the possible values in the range of 64 bit number that will yield the correct modulo.
The constant s should zero the first k bits if the given value have modulo of r. it will also zero the modulo of d_odd.

then the compiler should generate the following code with the given constants:

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}

this replace the following:

/* d is the divisor, r is the remainder */
int checkrem64(uint64_t x)
{
  return x%d==r;
}

this is the code to find modular inverse..

uint64_t find_mod_mul_inverse(uint64_t x, uint64_t bits)
{
      if (bits > 64 || ((x&1)==0))
              return 0;// invalid parameters
      uint64_t mask;
      if (bits == 64)
              mask = -1;
      else
      {                
              mask = 1;
              mask<<=bits;
              mask--;
      }
      x&=mask;
      uint64_t result=1, state=x, ctz=0;
      while(state!=1ULL)
      {
              ctz=__builtin_ctzll(state^1);
              result|=1ULL<<ctz;
              state+=x<<ctz;
              state&=mask;
      }
      return result;
}

good luck!
I tested this on all the cases of 10bit word size, and it passed.

*Edit:* I looked for something that will work for signed integers. I came up with something that would work with negative numbers if the following assumption was correct:

(-21)%10==9

but this assumption is not correct because (-21)%10 equals to -1.
anyway, the idea is like that, you shift the range and change s and accordingly:

void calculate_constants64(uint64_t d, uint64_t r, uint64_t &k,uint64_t &mmi, uint64_t &s,uint64_t& u)
{
	k=__builtin_ctzll(d);/* the power of 2 */
	uint64_t d_odd=d>>k;
	mmi=find_mod_mul_inverse(d_odd,64);
	/* 64 is word size*/
     //this is the added line to make it work with signed integers
      r+=0x8000 0000 0000 0000% d;
	s=(r*mmi);
	u=(ULLONG_MAX-r)/d;
}

int checkrem64(uint64_t k,uint64_t mmi, uint64_t s,uint64_t u,uint64_t x)
{
     //this is the added line to make it work with signed integers
//x came as signed number but was casted to unsigned
    x^=0x8000 0000 0000 0000;// this is addition simplified to xor, spaces for clarity.
    uint64_t o=((x*mmi)-s);
    o= (o>>k)|(o<<(64-k));/*ROTR64(o,k)*/
    return o<=u;
}

but there must be a way to tweak u and s to make it work on negative only numbers or positive only numbers.... it should be easy...
Can someone please contact me to do the math?
I need a background and explanation how the folding of unsigned numbers should act and then I will try to find for you the formula... dvoreader@gmail.com
@jdoerfert, please read this carefully, cause you did not calculated correctly the Q constant when D is even. that is why it did not worked for you when D was 4, and it will also not work for you when D is 6 or 10 etc.

This revision now requires changes to proceed.Feb 23 2019, 11:03 AM
xbolva00 commandeered this revision.Jun 14 2019, 2:47 PM
xbolva00 updated this revision to Diff 204851.
xbolva00 added a reviewer: hermord.

Rebased. Moved to DAGCombiner and fixed regression in jump_sign.ll.

xbolva00 edited the summary of this revision. (Show Details)Jun 14 2019, 3:00 PM
xbolva00 planned changes to this revision.Jun 14 2019, 3:08 PM

Oh, no. Still there :((

xbolva00 abandoned this revision.Jun 14 2019, 3:27 PM

Hmm, i don't see the button to take the review,
so i had to post a new one: D63391
The test/CodeGen/X86/jump_sign.ll regression is rather trivial, and is being fixed by a followup patch D63390.