This is an archive of the discontinued LLVM Phabricator instance.

Improve the legalisation lowering of UMULO
ClosedPublic

Authored by nagisa on Aug 5 2018, 8:00 AM.

Details

Summary
There is no way in the universe, that doing a full-width division in
software will be faster than doing overflowing multiplication in
software in the first place, especially given that this same full-width
multiplication needs to be done anyway.

This patch replaces the previous implementation with a direct lowering
into an overflowing multiplication algorithm based on half-width
operations.

Correctness of the algorithm was verified by exhaustively checking the
output of this algorithm for overflowing multiplication of 16 bit
integers against an obviously correct widening multiplication. Baring
any oversights introduced by porting the algorithm to DAG, confidence in
correctness of this algorithm is extremely high.

Following table shows the change in both t = runtime and s = space. The
change is expressed as a multiplier of original, so anything under 1 is
“better” and anything above 1 is worse.

+-------+-----------+-----------+-------------+-------------+
| Arch  | u64*u64 t | u64*u64 s | u128*u128 t | u128*u128 s |
+-------+-----------+-----------+-------------+-------------+
|   X64 |     -     |     -     |    ~0.5     |    ~0.64    |
|  i686 |   ~0.5    |   ~0.6666 |    ~0.05    |    ~0.9     |
| armv7 |     -     |   ~0.75   |      -      |    ~1.4     |
+-------+-----------+-----------+-------------+-------------+

Performance numbers have been collected by running overflowing
multiplication in a loop under `perf` on two x86_64 (one Intel Haswell,
other AMD Ryzen) based machines. Size numbers have been collected by
looking at the size of function containing an overflowing multiply in
a loop.

All in all, it can be seen that both performance and size has improved
except in the case of armv7 where code size has regressed for 128-bit
multiply. u128*u128 overflowing multiply on 32-bit platforms seem to
benefit from this change a lot, taking only 5% of the time compared to
original algorithm to calculate the same thing.

The final benefit of this change is that LLVM is now capable of lowering
the overflowing unsigned multiply for integers of any bit-width as long
as the target is capable of lowering regular multiplication for the same
bit-width. Previously, 128-bit overflowing multiply was the widest
possible.

Notes:

  • This change might have broken some tests I have not caught. I have no idea what tests are present and how to run them all, so I’ll leave it up to CI to build and run the tests.
    • ninja check-all seems to pass locally, but 1) I haven’t all targets enabled; and 2) Some of my previous revisions failed tests at CI even if I had all targets enabled…
  • I have no idea how style in LLVM is enforced so I tried my best to match style with the surrounding code by hand;
  • I have no idea who the reviewers should be so I just picked Eric who seems to have introduced this code in the first place;
  • I do not have commit access, so somebody will have to land this for me.

Diff Detail

Repository
rL LLVM

Event Timeline

nagisa created this revision.Aug 5 2018, 8:00 AM
nagisa edited the summary of this revision. (Show Details)Aug 5 2018, 8:02 AM
rkruppe added a subscriber: rkruppe.Aug 5 2018, 9:09 AM
nagisa updated this revision to Diff 159383.Aug 6 2018, 1:51 PM

I’ve added some codegen tests. Those just ensure that targets are able to lower this operation more than anything else, although I did manually verify the x64 assembly to check whether it really looks like what I expect.

I would very much love to add a test that actually executes some machine code with a number of test vectors, but I believe no such test family exists in LLVM.

I would very much love to add a test that actually executes some machine code with a number of test vectors, but I believe no such test family exists in LLVM.

That would require being able to execute code for an arbitrary target, which would in general require an emulator, which is a lot more hassle than it's worth.


No tests for umul.with.overflow.i64 on 32-bit?

lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2743 ↗(On Diff #159212)

Transforming umul_lohi to a call to __multi3 isn't particularly useful. We could expand it inline, but it's a lot of code.

2717 ↗(On Diff #159383)

This add can overflow, but I guess that can only happen if %0 is true?

2754 ↗(On Diff #159383)

Maybe BUILD_PAIR here instead, since it's going to get split anyway?

test/CodeGen/ARM/muloti2.ll
3 ↗(On Diff #159383)

v6 and v7 are basically identical for this purpose; the code isn't using any v7-specific instructions.

4 ↗(On Diff #159383)

AArch64 tests aren't allowed in test/CodeGen/ARM; they have to go in test/CodeGen/AArch64. (It's possible to build LLVM with the AArch64 backend enabled, but not the ARM backend.)

Probably should also test ARMv6 in Thumb mode for completeness (although I expect the result to be messy).

nagisa added a comment.Aug 7 2018, 3:31 AM

No tests for umul.with.overflow.i64 on 32-bit?

Thanks for the review. On one hand, I ended up forgetting as it was getting late, on the other hand, those are already tested by the definition of the i128 umulo lowering which uses i64 umulo in its implementation. I will add a few tests targetting i64 umulo specifically later today.

nagisa added inline comments.Aug 7 2018, 3:40 AM
lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
2743 ↗(On Diff #159212)

The intention to use umul_lohi was to specifically give targets the information that a widening multiply is expected here, so targets which natively support this operation could do that without necessarily inspecting the operands for ZERO_EXTEND.

Alas, targets like 32-bit ARM outright refuse to lower such operation for i64,i64 e.g. output, and, I assume, many more targets would have trouble with i128,i128 umul_lohi.

2717 ↗(On Diff #159383)

When this operation overflows, one or more of the %1.1, %2.1 and %0 will already be true and thus the whole operation will already have overflow bit set.

2754 ↗(On Diff #159383)

Will try, but not sure if you can add two "pairs" together.

nagisa updated this revision to Diff 159560.Aug 7 2018, 11:32 AM

Changed code to use BUILD_PAIR;
Added the tests for i64 umulo;
Moved the AArch tests to the right places;
Added a Thumbv6 i128 umulo test.

Please choose a different name for the tests; "muloti2.ll" isn't usefully indicating what the files actually test.

Correctness of the algorithm was verified by exhaustively checking the output of this algorithm for overflowing multiplication of 16 bit integers against an obviously correct widening multiplication.

What exactly did you verify? Looking again, I'm pretty sure your example IR doesn't compute the correct result: you never compute the product %LHS.HI * %RHS.HI.

test/CodeGen/X86/muloti.ll
60 ↗(On Diff #159560)

This test is pretty useless; could get rid of it, I guess, since it's covered by muloti2.ll.

nagisa added a comment.Aug 8 2018, 1:42 PM

Please choose a different name for the tests; "muloti2.ll" isn't usefully indicating what the files actually test.

Does something like umulo-legalisation-lowering.ll sound good?

Correctness of the algorithm was verified by exhaustively checking the output of this algorithm for overflowing multiplication of 16 bit integers against an obviously correct widening multiplication.

What exactly did you verify? Looking again, I'm pretty sure your example IR doesn't compute the correct result: you never compute the product %LHS.HI * %RHS.HI.

The computation of %LHS.HI * %RHS.HI is only necessary to compute the overflow bit. If multiplication was used, the check would look like this:

%product = { iNh, i1 } @umul.with.overflow.iNh(iNh %LHS.HI, iNh %RHS.HI)
%0 = product.0 != 0 || product.1 ; equivalent of the current %0

The %0 = %LHS.HI != 0 && %RHS.HI != 0 is an optimisation that calculates the same information without doing the multiply. Computing the product may be more efficient for some specific bitwidths on some targets, but I found the %LHS.HI != 0 && %RHS.HI != 0 variant to be more palatable in the general case.


The following Rust code is what I used to check the correctness of current algorithm exhaustively in a reasonable time-frame. I’m willing to port this test to C if there’s a demand for it.

type Half = u8;
type Full = u16;
type Double = u32;

const HALF_BITS: u32 = 8;
const FULL_BITS: u32 = 16;

pub fn obviously_correct(l: Full, r: Full) -> (Full, bool) {
    // Do a widening multiplication and check the high half to see if the multiplication
    // overflowed. Also correctly handles result wrapping in case of overflow.
    let doublewide = (l as Double).wrapping_mul(r as Double);
    (doublewide as Full, (doublewide >> FULL_BITS) != 0)
}

pub fn actual_implementation(l: Full, r: Full) -> (Full, bool) {
    let (lhs_lo, rhs_lo) = (l as Half, r as Half);
    let (lhs_hi, rhs_hi) = ((l >> HALF_BITS) as Half, (r >> HALF_BITS) as Half);

    let overflow0 = lhs_hi != 0 && rhs_hi != 0;
    let (r1, overflow1) = lhs_hi.overflowing_mul(rhs_lo);
    let (r2, overflow2) = rhs_hi.overflowing_mul(lhs_lo);
    let r3 = (lhs_lo as Full).wrapping_mul(rhs_lo as Full);
    let r4 = ((r1 as Full) << HALF_BITS).wrapping_add((r2 as Full) << HALF_BITS);
    let (r5, overflow5) = r4.overflowing_add(r3);
    (r5, overflow0 || overflow1 || overflow2 || overflow5)
}

pub fn main() {
    for lhs in Full::min_value()..=Full::max_value() {
        for rhs in Full::min_value()..=Full::max_value() {
            assert_eq!(obviously_correct(lhs, rhs), actual_implementation(lhs, rhs),
                       "results did not match for lhs={}, rhs={}", lhs, rhs);
        }
    }
}

The computation of %LHS.HI * %RHS.HI is only necessary to compute the overflow bit.

Oh, sorry, you're right, not sure what I was thinking.

I was reading the AArch64 code and thinking it looked strange, but the issue was just that the code was doing the operations in a strange order. An i128 multiply normally generates umulh+madd+madd, but for some reason your expansion generates mul+umulh+madd+add. Not really important.

nagisa updated this revision to Diff 160130.Aug 10 2018, 10:04 AM

Updated test filenames to better reflect what they are testing.

efriedma accepted this revision.Aug 13 2018, 4:57 PM

LGTM. (Do you want me to commit this for you?)

This revision is now accepted and ready to land.Aug 13 2018, 4:57 PM

Do you want me to commit this for you?

Yes, please. Thanks!

This revision was automatically updated to reflect the committed changes.