This is an archive of the discontinued LLVM Phabricator instance.

Support the division-by-constant strength reduction for more integer types
Needs ReviewPublic

Authored by nagisa on Sep 19 2020, 3:05 PM.

Details

Summary

The division-by-constant strength reduction into multiply-shift sequence of instructions can be
applied on ~all target at any integer width to gain significant throughput boost for the operation,
at a (fairly significant) cost of code size.

LLVM already has this optimisation, but it would only fire on integers
with bit-widths supported natively. For example on x86_64 divisions up to 64-bits would trigger the
optimisation and on i686 64-bit integers would no longer be strength-reduced anymore.

This commit adjusts the lowering code to apply this strength-reduction even on integer bit-widths
not natively supported by the target. Ideally this would've been implemented via fallback lowerings
for the ISD::MULHU and ISD::MULHS – not all of the backends support them – but I found that to
require significant refactors and it still failed to work on some backends such as the ARM or the
RISCV (without m instructions) regardless.

However, the targets will universally support ISD::MUL of any bit-width so we just take the upper
half of the regular MUL result. This will likely be sub-optimal in a sense that some of the
instructions may not do anything useful, but even with those instructions present the resulting
lowering should be significantly better compared to conventional software division implementations.

Depends on D88785

Diff Detail

Event Timeline

nagisa created this revision.Sep 19 2020, 3:05 PM

Still working on adding the tests, but I believe this is “done” otherwise.

nagisa edited the summary of this revision. (Show Details)Sep 19 2020, 3:08 PM
nagisa edited the summary of this revision. (Show Details)
nagisa updated this revision to Diff 293031.Sep 20 2020, 1:52 PM

Update the pre-existing tests

nagisa updated this revision to Diff 294528.Sep 26 2020, 5:43 PM

Allow non-legal shift types too

nagisa published this revision for review.Sep 26 2020, 5:43 PM
nagisa updated this revision to Diff 294529.Sep 26 2020, 6:17 PM

Don't do any of this after legalization has already happend

craig.topper added inline comments.Sep 26 2020, 8:57 PM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4977

IsAfterLegalization refers to LegalOperations not LegalTypes.

4980–4981

I believe IsAfterLegalization refers to LegalOperations rather than LegalTypes.

But if we get here after type legalization then the VT must be Legal or it wouldn't have been seen by DAGCombiner to call this so we might just be able to remove this whole check.

5049

This needs to be a check for LegalTypes not LegalOperations.

llvm/test/CodeGen/X86/vshli-simplify-demanded-bits.ll
11–25

Why did the amount of code increase here? This is a legal type so why was it affected?

craig.topper added inline comments.Sep 26 2020, 8:59 PM
llvm/test/CodeGen/X86/vshli-simplify-demanded-bits.ll
11–25

Oh I misread the arguments. One is a legal type but the other isn't.

nikic edited reviewers, added: efriedma; removed: eli.friedman.Sep 27 2020, 1:08 AM
nagisa added inline comments.Sep 27 2020, 5:04 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
4977

Any suggestions on how to best approach obtaining the information? Should I just pass in another boolean as an argument from DAGCombiner?

4980–4981

Yeah, I had this removed entirely in an earlier revision and it worked just fine, I had this added back motivating to myself that there might be some weird corner-case that I'm not aware of where “just do it” approach wouldn't be correct.

Adding @nhaehnle who tried something similar at D24822

I'm skeptical this is a good idea when the division is wider than the widest legal mulhi. You end up generating either a ton of inline code, or a libcall; the result might not be faster than the original divide libcall.

The improvements along the lines of llvm/test/CodeGen/AArch64/srem-seteq.ll are obviously profitable.

Also, I think I remember some discussion that the compiler-rt implementations of division on x86 have performance issues.

I'm skeptical this is a good idea when the division is wider than the widest legal mulhi. You end up generating either a ton of inline code, or a libcall; the result might not be faster than the original divide libcall.

LLVM (with this change) is definitely going to generate a ton of inline code. For instance a single function containing sdiv i1024 generates a whooping 2238 lines of x86_64 assembly. However, it does manage to generate any assembly at all. LLVM could not do it before because there are no libcalls it knows about >128bit. So that's already somewhat of an improvement. This is also not an unique problem. Legalization of other super wide operations, even for things as simple as add or even various shifts, expand to large amounts of code too.

You are also right that with this change LLVM may generate multiplication libcalls (such as when dividing i128 integers). Multiplications are comparatively easy to implement efficiently in software, where they become a tree of smaller multiplications, added together. And so, generation of multiplication libcalls is perhaps even desirable, as it reduces the amount of code generated inline somewhat! You can’t do anything alike for division.

Ultimately however, i64 and i128 operations are all that matter. So as a quick comparison I could produce in the short amount of time I had for this comment:

div RThroughputThis expansion RThroughput
i686 core2: i64 / 4218-37[^2]13.8[^1]
i686 core2: i128 / 42libcall (?)61.0[^1]
x86_64 znver2: i128 / 4213-44[^3]8.0[^4]

[^1]: Calculated by llvm-mca -mcpu=core2 -mtriple=i686
[^2]: Taken from Agner's instruction tables for “Intel Core 2 (Merom, 65nm)”.
[^3]: Taken from Agner's instruction tables for “AMD Zen 2” (used zen instead of skylake, because zen's native instruction throughput is better).
[^4]: Calculated by llvm-mca -mcpu=znver2 -mtriple=x86_64

In either of these two instances the strength-reduced operation has a better throughput than the best case of a native (although most likely micro-coded) division instruction. I strongly doubt a software implementation of division could do at all better than such a native instruction. Although I could see scales tipping for targets where there are no native multiplication instructions.

Finally, isIntDivCheap exists and should allow targets to prevent this optimisation where it makes sense for them?


For these computations I used the following snippet of code (and equivalent with s/i128/i64/ for i686):

define dso_local i128 @foo(i128 %x) local_unnamed_addr #0 {
entry:
  %d = udiv i128 %x, 42
  ret i128 %d
}

Also, I think I remember some discussion that the compiler-rt implementations of division on x86 have performance issues.

We recently got heavily optimised software division implementations in Rust's compiler-builtins. I could compare against those as well, but many of them are very architecture specific, and I don’t have good means for cycle-accurate measurements outside of x86_64.

please can you rebase?

nagisa updated this revision to Diff 294647.Sep 28 2020, 4:15 AM

Gate lowering on whether we're after type lowering & rebase

RKSimon added inline comments.Sep 28 2020, 4:17 AM
llvm/test/CodeGen/X86/i128-sdiv.ll
3

Can you add this coverage back?

nagisa updated this revision to Diff 294649.Sep 28 2020, 4:19 AM

redo formatting

nagisa updated this revision to Diff 294650.Sep 28 2020, 4:21 AM

restore i128-sdiv RUN lines

nagisa marked 4 inline comments as done.Sep 28 2020, 4:22 AM
nagisa added inline comments.
llvm/test/CodeGen/X86/i128-sdiv.ll
3

Huh, not sure how it got removed.

nagisa updated this revision to Diff 294651.Sep 28 2020, 4:23 AM
nagisa marked an inline comment as done.

Restore i128-sdiv further

Harbormaster completed remote builds in B73143: Diff 294650.

In either of these two instances the strength-reduced operation has a better throughput than the best case of a native (although most likely micro-coded) division instruction. I strongly doubt a software implementation of division could do at all better than such a native instruction. Although I could see scales tipping for targets where there are no native multiplication instructions.

There are 2 issues here that x86 in particular avoids:

  1. Some targets don't really have a multiplier, like certain RISC-V variants. Or some targets have a multiply instruction that's hard to use here, like Cortex-M0.
  2. The libcall is doing a ton of extra work to produce the result of an NxN->N multiply. We don't have the libcall variant we want here.

However, it does manage to generate any assembly at all. LLVM could not do it before because there are no libcalls it knows about >128bit. So that's already somewhat of an improvement.

If we really cared about this, we could emit a general-purpose implementation inline. We don't do this because that isn't what anyone wants anyway.

llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll
207

Any idea what's going on here?

nagisa added inline comments.Oct 3 2020, 9:49 AM
llvm/test/CodeGen/AArch64/urem-seteq-nonzero.ll
207

Good catch.

With the old code the urem gets promoted to i32 first, which AFAICT then allows some other validity check pass in the SimplifySetCC (which doesn't otherwise work for i16/i8), and thus apply

(seteq/ne (urem N, D), 0) -> (setule/ugt (rotr (mul N, P), K), Q)

anyway. With the adjusted BuildUDIV the multiply-shift reduction manages to get applied before urem i16 gets promoted to urem i32.

nagisa updated this revision to Diff 296046.Oct 4 2020, 6:34 AM

Rebase on top of D88785

nagisa edited the summary of this revision. (Show Details)Oct 4 2020, 6:34 AM
nagisa updated this revision to Diff 296047.Oct 4 2020, 7:09 AM

Pass around DAGCombinerInfo instead of booleans

llvm/test/CodeGen/RISCV/srem-lkk.ll