This is an archive of the discontinued LLVM Phabricator instance.

PR41162 Implement LKK remainder and divisibility algorithms [urem]
Needs ReviewPublic

Authored by TG908 on Oct 2 2019, 4:31 PM.

Details

Summary

This patch implements the LKK algorithm for optimizing remainder computations with a constant divisor.
LKK is an improvement on the previously used Granlund-Montgomery-Warren approach.

SREM Patch
PR41162
LKK Paper

https://rise4fun.com/Alive/HiT

Notes:

  • I added functions foldSREM and foldUREM in DAGCombiner.cpp to handle signed and unsigned remainders.
  • Tests have been performed on x86 for every 8 bit signed and unsigned remainder operation.
  • Tests for 32 bit signed remainders have been performed on a large set of random integers.
  • I also added some llc code gen tests.
  • I am a bit unsure whether my isOperationLegalOrCustomOrPromote checks are too strict.

Diff Detail

Event Timeline

TG908 created this revision.Oct 2 2019, 4:31 PM
efriedma added inline comments.Oct 2 2019, 5:04 PM
llvm/test/CodeGen/AArch64/srem-llk.ll
20 ↗(On Diff #222924)

This looks like it's actually more instructions than trunk.

jdoerfert resigned from this revision.Oct 2 2019, 10:18 PM

Unsure how I ended up being a reviewer here in the first place.

Please precommit all new tests and rebase.

TG908 edited the summary of this revision. (Show Details)Oct 3 2019, 12:42 AM
TG908 updated this revision to Diff 222976.Oct 3 2019, 1:39 AM

put new tests in a separate commit

foad added inline comments.Oct 3 2019, 2:27 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3970

Why "same comparison result"? Do you mean "same result"?

foad added inline comments.Oct 3 2019, 2:27 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3932

Is the "lower X%C to the equivalent of X-X/C*C" code below obsolete now, or is it still required in some cases?

4069

Ditto.

TG908 added inline comments.Oct 3 2019, 2:51 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3932

It is required when:

  • we have both a rem and a div node and we want to combine them.
  • for rem operations on integer types greater than half the max register width.
3970

yes

TG908 updated this revision to Diff 222985.Oct 3 2019, 4:28 AM
  • put tests in separate commit
  • ran tests
  • fixed comment error
  • updated tests
TG908 marked 3 inline comments as done.Oct 3 2019, 4:29 AM

put new tests in a separate commit

Well, New test files are still part of this revision. Please just land new test files and then rebase this patch so we can see codegen diff.

xbolva00 added inline comments.Oct 3 2019, 4:33 AM
llvm/test/CodeGen/X86/urem-lkk.ll
4

Typo in name

  • put tests in separate commit

You need to generate diff as compared to that commit (which you should put into a new review), not as compared to master.

Can you please also pre commit the new tests like srem-lkk.ll urem-lkk.ll to PowerPC codegen test dir as well, so that we can see what is the impact to PowerPC? Thanks.

And maybe RISC V too..

  • put tests in separate commit

You need to generate diff as compared to that commit (which you should put into a new review), not as compared to master.

Almost there...

phab only shows the patch you uploaded, so you need to rebase the patch to actually show the changed tests

TG908 updated this revision to Diff 223409.Oct 6 2019, 3:07 AM
TG908 added a comment.Oct 6 2019, 3:19 AM

We've got some mixed results here:

    • urem seems to work quite well on X86. srem not so much.
    • for most vector operations on X86 and AArch64 the optimization seem beneficial
    • urem / srem on AArch64 come with quite some overhead for loading constants
  • for RISV only the non vector tests changed
  • for Power only the vector tests changed
TG908 added a comment.EditedOct 6 2019, 3:28 AM

A C implementation of LKK suggests that it should be possible to generate better code on AArch64.

Edit: Bad ARM performance is due to the fact that loading a 64 Bit constant takes 4 instructions.

lebedev.ri added inline comments.Oct 6 2019, 3:52 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3919–3920
if (!TLI.isIntDivCheap(VT, Attr) && isConstantOrConstantVector(N1) && DAG.isKnownNeverZero(N1)) {
3923–3925

Not everything has div-rem operation, this check should only be done if said op exists.
That being said, i'm not sure this check should be here, DAGCombiner::visitSDIV() already
has // sdiv, srem -> sdivrem fold.

3974

This isn't folding, it's lowering.
This should be in TargetLowering.{h,cpp}

3983

So no 64-bit remainders on basically everything?

3998–3999

I think you want just isOperationLegalOrCustom().

TG908 marked 2 inline comments as done.Oct 6 2019, 5:19 AM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3983

According to the paper a good choice for the number of fraction bits would be twice the number of numerator bits. This limits us on 64bit platforms to 32bit rem only.

3998–3999

agree

TG908 marked an inline comment as done.Oct 6 2019, 5:23 AM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3974

Good point. I'll move it.

TG908 marked an inline comment as done.Oct 6 2019, 5:27 AM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3923–3925

This is also used to combine div and rem operations. This can be done with a divrem operation or by using the Granlund-Montgomery-Warren approach.

I do not follow the proposed lowering for urem.
Is this what's proposed: https://rise4fun.com/Alive/Y03 ?

I do not follow the proposed lowering for urem.
Is this what's proposed: https://rise4fun.com/Alive/Y03 ?

Ok, i see, nice: https://rise4fun.com/Alive/HiT

lebedev.ri added inline comments.Oct 6 2019, 11:08 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3974

While you are at it, please split srem part into a new patch.

TG908 updated this revision to Diff 223446.Oct 6 2019, 3:43 PM
TG908 retitled this revision from PR41162 Implement LKK remainder and divisibility algorithms to PR41162 Implement LKK remainder and divisibility algorithms [UREM].Oct 6 2019, 3:49 PM
TG908 edited the summary of this revision. (Show Details)
TG908 retitled this revision from PR41162 Implement LKK remainder and divisibility algorithms [UREM] to PR41162 Implement LKK remainder and divisibility algorithms [urem].

It looks that we now use hacker's deligth lowering for urem?
Where is that lowering being performed?

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4944

This should be a simple add.

4951–4953

There is no such restriction.

4954–4955

I believe what you want to do, is to check whether *all* divisors are powers of two, and avoid *this* fold then.
If at least one of them is not a power of two this should still be good.
That being said many of the test changes look like regressions.

TG908 marked 2 inline comments as done.Oct 7 2019, 3:18 PM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4951–4953

What do you mean?
In my code?
In LKK?

4954–4955

Yeah.
I think the regressions came from changing the way I was checking if MUL is available.

lebedev.ri added inline comments.Oct 7 2019, 3:20 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4951–4953

I haven't read the paper, so i'm only looking at this code,
and they don't incur any restrictions on the divisor: https://rise4fun.com/Alive/HiT

TG908 marked 3 inline comments as done and an inline comment as not done.Oct 7 2019, 3:22 PM
TG908 marked 2 inline comments as done and an inline comment as not done.Oct 7 2019, 6:03 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3945

Where is that lowering being performed?

Right here

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4951–4953

Oops. You are right.

TG908 updated this revision to Diff 223824.Oct 8 2019, 4:26 AM
TG908 added a comment.Oct 8 2019, 4:30 AM

ARM performance could still be good in loops once the constant has been loaded.
Need to tests that.

TG908 added a comment.EditedOct 9 2019, 6:13 AM

I tested loops containing a rem operation on AArch64.
With LKK the loop body contains 3 fewer instructions.

trunk
lkk

I will update the tests soon.

lenary removed a subscriber: lenary.Oct 9 2019, 6:17 AM
TG908 marked 8 inline comments as done.Oct 9 2019, 8:13 AM
TG908 marked an inline comment as done.Oct 9 2019, 8:16 AM
TG908 marked an inline comment as done.Oct 9 2019, 9:02 AM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4932

This right here seems to fail on riscv64 +m with:

(lldb) p FVT

(llvm::EVT) $1 = {
  V = (SimpleTy = i64)
  LLVMTy = 0x0000000000000000
}
(lldb) p VT

(llvm::EVT) $2 = {
  V = (SimpleTy = i32)
  LLVMTy = 0x0000000000000000
}

Those types should be legal right? What am I missing?

(lldb) expr isTypeLegal(VT)
(bool) $5 = false
(lldb) expr isTypeLegal(FVT)
(bool) $6 = true
rogfer01 added inline comments.Oct 9 2019, 12:37 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4932

In riscv64 only i64 is legal because registers are 64-bit and instructions operate with all the bits of the GPRs.

In other words, the current assortment of instructions that would be useable to implement 32-bit operations (there are just a few of them) is not broad enough to warrant making i32 legal in RISC-V.

TG908 updated this revision to Diff 224324.Oct 10 2019, 6:04 AM
nikic resigned from this revision.Oct 22 2019, 1:50 PM
xbolva00 added inline comments.Oct 29 2019, 7:58 AM
llvm/test/CodeGen/X86/vector-width-store-merge.ll
1 ↗(On Diff #224324)

Not related to this patch. Revert this change from this patch or commit it as NFC patch.

Same above in some cases.

TG908 updated this revision to Diff 227133.Oct 30 2019, 10:08 AM
TG908 marked an inline comment as done.Oct 30 2019, 10:11 AM
TG908 added inline comments.
llvm/test/CodeGen/X86/vector-width-store-merge.ll
1 ↗(On Diff #224324)

for this file I think it is. For the others you are right,

TG908 added a comment.EditedOct 31 2019, 5:20 AM

I am still looking for a better solution for checking if the types are legal and the optimization can be applied.
Right now I get an integer type VT for the rem operation and I check if a type twice as wide (FVT) is legal.

// Check to see if we can do this.
 if (IsAfterLegalization && !isTypeLegal(FVT))
   return SDValue();

On RISCV the first iteration of a 32 bit rem operation will not cause an optimization. After the next round of legalization the operations gets expanded to 64 bit and the the optimization gets applied.
Checking for both VT and FVT at the same time doesn't work since RISCV only has i64 and not i32. Thus VT and FVT are never legal at the same time.

Can I leave it like this and rely on the legalization phase to take care of it or should I handle the case explicitly where VT and FVT are not legal at the same time by expanding VT to the size of FVT manually?

 // Check to see if we can do this.
  if (IsAfterLegalization && !isTypeLegal(FVT)) {
    return SDValue();
} else if isTypeLegal(FVT) {
   VT = FVT;
   ...
}
...
TG908 added a comment.Nov 1 2019, 4:24 PM
This comment was removed by TG908.
TG908 marked an inline comment as done.Nov 1 2019, 4:30 PM
TG908 added inline comments.
llvm/test/CodeGen/X86/vector-width-store-merge.ll
1 ↗(On Diff #224324)

never mind

TG908 updated this revision to Diff 227576.Nov 2 2019, 7:50 AM
xbolva00 added inline comments.Nov 2 2019, 8:14 AM
llvm/test/CodeGen/X86/urem-lkk.ll
110

This loop test we dont need I think.

If you want to leave it, please regenerate CHECKs.

llvm/test/CodeGen/X86/vector-idiv-udiv-256.ll
577

Not sure if this avx512’s code is a improvement.

@craig.topper ?

TG908 marked an inline comment as done.Nov 2 2019, 8:21 AM
TG908 added inline comments.
llvm/test/CodeGen/X86/urem-lkk.ll
110

The loop test was meant to demonstrate that lkk is beneficial on aarch64 even though one single rem operation emits more instructions than the old approach. In a loop the loading of constants only happens once so there overall performance with lkk is better than without. I will remove this since this was only a test.

Still some regressions.

llvm/test/CodeGen/X86/urem-vector-lkk.ll
55

this is worse

llvm/test/CodeGen/X86/vector-idiv-udiv-128.ll
629

regression

lebedev.ri edited the summary of this revision. (Show Details)Nov 24 2019, 7:22 AM
lebedev.ri resigned from this revision.Jun 12 2020, 5:27 AM
RKSimon added inline comments.Jun 17 2020, 1:05 AM
llvm/test/CodeGen/X86/vector-idiv-udiv-256.ll
577

vpmulld/vpmuludq (vXi32 mul ops) are notably slower than vpmullw/vpmulhuw (vXi16) - we need to avoid this.

RKSimon added inline comments.Jun 18 2020, 1:49 AM
llvm/include/llvm/CodeGen/TargetLowering.h
3976

Add a BuildUREM specific comment and give BuildSDIVPow2 its comment back

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3919–3920

+1 Avoid (more costly) known bits analysis by using an early out on cheaper checks.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4925

Can we call this something more descriptive than just 'F'?

4951

Maybe put the assert straight after the 'const APInt &D = DivisorConstant->getAPIntValue();' line?

4964

superfluous comment?

4968

superfluous comment?

dexonsmith resigned from this revision.Oct 16 2020, 12:32 PM
RKSimon added a subscriber: nagisa.

Adding @nagisa who has been working on D88785 recently

I don't know if @TG908 is still interested in developing this, else whether anyone is interested in commandeering it.