Page MenuHomePhabricator

PR41162 Implement LKK remainder and divisibility algorithms [urem]
Needs ReviewPublic

Authored by TG908 on Wed, Oct 2, 4:31 PM.

Details

Summary

This patch implements the LKK algorithm for optimizing remainder computations with a constant divisor.
LKK is an improvement on the previously used Granlund-Montgomery-Warren approach.

SREM Patch
PR41162
LKK Paper

Notes:

  • I added functions foldSREM and foldUREM in DAGCombiner.cpp to handle signed and unsigned remainders.
  • Tests have been performed on x86 for every 8 bit signed and unsigned remainder operation.
  • Tests for 32 bit signed remainders have been performed on a large set of random integers.
  • I also added some llc code gen tests.
  • I am a bit unsure whether my isOperationLegalOrCustomOrPromote checks are too strict.

Diff Detail

Event Timeline

TG908 created this revision.Wed, Oct 2, 4:31 PM
efriedma added inline comments.Wed, Oct 2, 5:04 PM
llvm/test/CodeGen/AArch64/srem-llk.ll
20 ↗(On Diff #222924)

This looks like it's actually more instructions than trunk.

jdoerfert resigned from this revision.Wed, Oct 2, 10:18 PM

Unsure how I ended up being a reviewer here in the first place.

Please precommit all new tests and rebase.

TG908 edited the summary of this revision. (Show Details)Thu, Oct 3, 12:42 AM
TG908 updated this revision to Diff 222976.Thu, Oct 3, 1:39 AM

put new tests in a separate commit

foad added inline comments.Thu, Oct 3, 2:27 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3932

Is the "lower X%C to the equivalent of X-X/C*C" code below obsolete now, or is it still required in some cases?

3970

Why "same comparison result"? Do you mean "same result"?

4069

Ditto.

TG908 added inline comments.Thu, Oct 3, 2:51 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3932

It is required when:

  • we have both a rem and a div node and we want to combine them.
  • for rem operations on integer types greater than half the max register width.
3970

yes

TG908 updated this revision to Diff 222985.Thu, Oct 3, 4:28 AM
  • put tests in separate commit
  • ran tests
  • fixed comment error
  • updated tests
TG908 marked 3 inline comments as done.Thu, Oct 3, 4:29 AM

put new tests in a separate commit

Well, New test files are still part of this revision. Please just land new test files and then rebase this patch so we can see codegen diff.

xbolva00 added inline comments.Thu, Oct 3, 4:33 AM
llvm/test/CodeGen/X86/urem-lkk.ll
4

Typo in name

  • put tests in separate commit

You need to generate diff as compared to that commit (which you should put into a new review), not as compared to master.

Can you please also pre commit the new tests like srem-lkk.ll urem-lkk.ll to PowerPC codegen test dir as well, so that we can see what is the impact to PowerPC? Thanks.

And maybe RISC V too..

  • put tests in separate commit

You need to generate diff as compared to that commit (which you should put into a new review), not as compared to master.

Almost there...

phab only shows the patch you uploaded, so you need to rebase the patch to actually show the changed tests

TG908 updated this revision to Diff 223409.Sun, Oct 6, 3:07 AM
TG908 added a comment.Sun, Oct 6, 3:19 AM

We've got some mixed results here:

    • urem seems to work quite well on X86. srem not so much.
    • for most vector operations on X86 and AArch64 the optimization seem beneficial
    • urem / srem on AArch64 come with quite some overhead for loading constants
  • for RISV only the non vector tests changed
  • for Power only the vector tests changed
TG908 added a comment.EditedSun, Oct 6, 3:28 AM

A C implementation of LKK suggests that it should be possible to generate better code on AArch64.

Edit: Bad ARM performance is due to the fact that loading a 64 Bit constant takes 4 instructions.

lebedev.ri added inline comments.Sun, Oct 6, 3:52 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3919–3920
if (!TLI.isIntDivCheap(VT, Attr) && isConstantOrConstantVector(N1) && DAG.isKnownNeverZero(N1)) {
3923–3925

Not everything has div-rem operation, this check should only be done if said op exists.
That being said, i'm not sure this check should be here, DAGCombiner::visitSDIV() already
has // sdiv, srem -> sdivrem fold.

3974

This isn't folding, it's lowering.
This should be in TargetLowering.{h,cpp}

3983

So no 64-bit remainders on basically everything?

3998–3999

I think you want just isOperationLegalOrCustom().

TG908 marked 2 inline comments as done.Sun, Oct 6, 5:19 AM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3983

According to the paper a good choice for the number of fraction bits would be twice the number of numerator bits. This limits us on 64bit platforms to 32bit rem only.

3998–3999

agree

TG908 marked an inline comment as done.Sun, Oct 6, 5:23 AM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3974

Good point. I'll move it.

TG908 marked an inline comment as done.Sun, Oct 6, 5:27 AM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3923–3925

This is also used to combine div and rem operations. This can be done with a divrem operation or by using the Granlund-Montgomery-Warren approach.

I do not follow the proposed lowering for urem.
Is this what's proposed: https://rise4fun.com/Alive/Y03 ?

I do not follow the proposed lowering for urem.
Is this what's proposed: https://rise4fun.com/Alive/Y03 ?

Ok, i see, nice: https://rise4fun.com/Alive/HiT

lebedev.ri added inline comments.Sun, Oct 6, 11:08 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3974

While you are at it, please split srem part into a new patch.

TG908 updated this revision to Diff 223446.Sun, Oct 6, 3:43 PM
TG908 retitled this revision from PR41162 Implement LKK remainder and divisibility algorithms to PR41162 Implement LKK remainder and divisibility algorithms [UREM].Sun, Oct 6, 3:49 PM
TG908 edited the summary of this revision. (Show Details)
TG908 retitled this revision from PR41162 Implement LKK remainder and divisibility algorithms [UREM] to PR41162 Implement LKK remainder and divisibility algorithms [urem].

It looks that we now use hacker's deligth lowering for urem?
Where is that lowering being performed?

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4934

This should be a simple add.

4941–4943

There is no such restriction.

4944–4945

I believe what you want to do, is to check whether *all* divisors are powers of two, and avoid *this* fold then.
If at least one of them is not a power of two this should still be good.
That being said many of the test changes look like regressions.

TG908 marked 2 inline comments as done.Mon, Oct 7, 3:18 PM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4941–4943

What do you mean?
In my code?
In LKK?

4944–4945

Yeah.
I think the regressions came from changing the way I was checking if MUL is available.

lebedev.ri added inline comments.Mon, Oct 7, 3:20 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4941–4943

I haven't read the paper, so i'm only looking at this code,
and they don't incur any restrictions on the divisor: https://rise4fun.com/Alive/HiT

TG908 marked 3 inline comments as done and an inline comment as not done.Mon, Oct 7, 3:22 PM
TG908 marked 2 inline comments as done and an inline comment as not done.Mon, Oct 7, 6:03 PM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
3945

Where is that lowering being performed?

Right here

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4941–4943

Oops. You are right.

TG908 updated this revision to Diff 223824.Tue, Oct 8, 4:26 AM
TG908 added a comment.Tue, Oct 8, 4:30 AM

ARM performance could still be good in loops once the constant has been loaded.
Need to tests that.

TG908 added a comment.EditedWed, Oct 9, 6:13 AM

I tested loops containing a rem operation on AArch64.
With LKK the loop body contains 3 fewer instructions.

trunk
lkk

I will update the tests soon.

lenary removed a subscriber: lenary.Wed, Oct 9, 6:17 AM
TG908 marked 8 inline comments as done.Wed, Oct 9, 8:13 AM
TG908 marked an inline comment as done.Wed, Oct 9, 8:16 AM
TG908 marked an inline comment as done.Wed, Oct 9, 9:02 AM
TG908 added inline comments.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4922

This right here seems to fail on riscv64 +m with:

(lldb) p FVT

(llvm::EVT) $1 = {
  V = (SimpleTy = i64)
  LLVMTy = 0x0000000000000000
}
(lldb) p VT

(llvm::EVT) $2 = {
  V = (SimpleTy = i32)
  LLVMTy = 0x0000000000000000
}

Those types should be legal right? What am I missing?

(lldb) expr isTypeLegal(VT)
(bool) $5 = false
(lldb) expr isTypeLegal(FVT)
(bool) $6 = true
rogfer01 added inline comments.Wed, Oct 9, 12:37 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4922

In riscv64 only i64 is legal because registers are 64-bit and instructions operate with all the bits of the GPRs.

In other words, the current assortment of instructions that would be useable to implement 32-bit operations (there are just a few of them) is not broad enough to warrant making i32 legal in RISC-V.

TG908 updated this revision to Diff 224324.Thu, Oct 10, 6:04 AM