This is an archive of the discontinued LLVM Phabricator instance.

[mlir][arith] Support wide integer multiplication emulation
ClosedPublic

Authored by kuhar on Sep 9 2022, 9:12 PM.

Details

Summary

Emulate multiplication by splitting each input element of type i2N into 4
digits of type iN and bit width i(N/2). This is so that the intermediate
multiplications and additions do not overflow. We extract these i(N/2)
digits from iN vector elements by masking (low digit) and shifting right
(high digit).

The multiplication algorithm used is the standard (long) multiplication.
Multiplying two i2N integers produces (at most) a i4N result, but because
the calculation of top i2N is not necessary, we omit it.
In total, this implementations performs 10 intermediate multiplications
and 16 additions. The number of multiplications could be decreased by
switching to a more efficient algorithm like Karatsuba. This would,
however, require being able to perform (intermediate) wide additions and
subtractions, so it is not clear that such implementation would be more
efficient.

I tested this on all 16-bit inut pairs, when emulating i16 with i8.

Diff Detail

Event Timeline

kuhar created this revision.Sep 9 2022, 9:12 PM
Herald added a project: Restricted Project. · View Herald Transcript
kuhar requested review of this revision.Sep 9 2022, 9:12 PM
kuhar updated this revision to Diff 459250.Sep 9 2022, 9:23 PM

Disallow emulation with i1 element type, which would complicate multiplication.

kuhar updated this revision to Diff 459463.Sep 12 2022, 8:02 AM

Fix low digit mask generation for very wide integer types.

Mogball added inline comments.Sep 13 2022, 9:43 AM
mlir/lib/Dialect/Arithmetic/Transforms/EmulateWideInt.cpp
56
71
mlir/test/Dialect/Arithmetic/emulate-wide-int.mlir
221

it would be nice to have "integration tests" for some of these expansions. Verifying the generated code for correctness is kind of difficult if it's this long. Having some cases with mlir-cpu-runner that produce the expected output would boost confidence that the implementation is correct (instead of me having to comb through all this code :( )

kuhar added inline comments.Sep 13 2022, 6:32 PM
mlir/test/Dialect/Arithmetic/emulate-wide-int.mlir
221

That's a very good point and something I had a couple of attempts at. Coming up with these emulation patterns is not trivial to me, which made me consider a few approaches to gain confidence in the implementation:

  1. Convert to the LLVM Dialect and then to LLVM IR. Verify that the emulated op matches the wide one, using Alive 2. Unfortunately, this did not scale for me and I would get timeouts when trying to validate emulated addi (which was much simpler than muli or shrui!)
  1. Hand-pick a few corner cases, provide input as constants, and rely on the constant folding code to get the final result as constants, which could be checked with LIT. I run into a few issues with missing folds, but even if we add anything that's missing, this still seems fragile to me. This could effectively force the constant folding code to support some CPU-intensive folds that are not useful otherwise.
  1. Add runtime integration tests in IREE, based on iree-run-module. I discarded this idea because it would force me add memref support earlier than I would like, and either require some cherry picks for our llvm-project fork, or waiting until llvm-project phabricator patches have landed. And ultimately, this is not very desirable from the perspective of llvm-project itself.
  1. Add runtime integration tests based on a bespoke test harness. This is what I ended up implementing to verify the implementation locally, and is available on my fork: https://github.com/kuhar/llvm-project/tree/arith_addi_carry/jakub. This is way too hacky to be upstreamed though. Checking all 16-bit input pairs takes around ~30s on my machine (on a single thread).

I haven't looked into mlir-cpu-runner yet, I'll see what it would take to make it work and be upstreamable. Thanks for the suggestion!

mlir-cpu-runner uses LLVM's OrcJIT to actually run the code. There are other integration tests that use it. The tests don't need to be robust in the sense that "all" cases are checked, but a few integration tests checking a mix of simple and complex cases would boost confidence that the code will be hard to break

kuhar updated this revision to Diff 460601.Sep 15 2022, 7:14 PM

Add simple (runtime) integration tests best on mlir-cpu-runner.

kuhar marked an inline comment as done.Sep 15 2022, 7:15 PM
Mogball accepted this revision.Sep 16 2022, 8:36 AM

awesome

mlir/lib/Dialect/Arithmetic/Transforms/EmulateWideInt.cpp
357

It looks like i+j is used more frequency in the loop than j. Can you make the for loop for (unsigned j = i; j != e; ++j)?

This revision is now accepted and ready to land.Sep 16 2022, 8:36 AM
kuhar marked an inline comment as done.Sep 16 2022, 9:00 AM
kuhar added inline comments.
mlir/lib/Dialect/Arithmetic/Transforms/EmulateWideInt.cpp
357

I think it's more straightforward as-is. If we defined j as i, we would have to introduce a subtraction to access rhsDigits.

kuhar updated this revision to Diff 460790.Sep 16 2022, 9:01 AM
kuhar marked an inline comment as done.

Rebased.

This revision was landed with ongoing or failed builds.Sep 16 2022, 9:04 AM
This revision was automatically updated to reflect the committed changes.