This is an archive of the discontinued LLVM Phabricator instance.

AggressiveInstCombine: Fold full mul i64 x i64 -> i128
AbandonedPublic

Authored by chfast on Jan 2 2019, 12:15 PM.

Details

Summary

This PR tries to match full multiplication pattern i64 x i64 -> i128 done by 4 i32 x i32 -> i64 multiplication and meshing the results of those.

This pattern has two outputs: high & low parts and it makes the matching a bit difficult especially when you consider this is my first pattern matcher.

Currently high and low parts are mapped independently what result in generation of two multiplications. I have 3 ideas how to fix this, but suggestions welcome:

  1. Find another pass capable of merging the same multiplications. I tried InstCombine, but instead of merging 2 identical i128 multiplications it rather truncates on of them.
  1. Separate pattern matching from instruction rewrite. Firstly find all patterns and remember them in a worklist. Later try to map patters for low and high by their arguments.
  1. When on of the patterns is found, try to find the pattern for the other part by traversing basic block further.

Diff Detail

Event Timeline

chfast created this revision.Jan 2 2019, 12:15 PM

Haven't taken a deep look yet, but some preliminary thoughts.
Also, i don't think this should be hardcoded to some particular bitwidth.

lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
255–265

I don't see why these need to be actual functions, lambdas will do?

312–316
if (match(&I, m_c_Or(m_LowPart(m_Value(t0)),
                     m_Shl(m_c_Add(m_LowPart(m_c_Add(m_HighPart(m_Deferred(t0)),
                                                     m_Value(t1))),
                                   m_Value(t2)),
                           m_SpecificInt(32))))) {
331

and now you only have t0, no t0a

test/Transforms/AggressiveInstCombine/mul128.ll
1 ↗(On Diff #179921)

Please use llvm/utils/update_test_checks.py.
And move the initial test case into another review, so this diff shows the *change* in the test output.

chfast edited the summary of this revision. (Show Details)Jan 2 2019, 12:24 PM
chfast marked 2 inline comments as done.Jan 2 2019, 12:33 PM
chfast added inline comments.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
255–265

Yes, that's true. I was struggling with the matchers at first so I did them by copy&paste. Will fix that, unless this pattern is useful to someone else.

312–316

That's cool. m_Deferred definitely lacks some documentation.

chfast added a comment.Jan 2 2019, 2:28 PM

Also, i don't think this should be hardcoded to some particular bitwidth.

Yes, I agree. However, I don't know how to check for bitwidth in match().

Quuxplusone added inline comments.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
267

Does it also match the other two I mentioned in the cfe-dev thread? Specifically, where you have (my version TWO):

%u1ls = shl i64 %u1, 32
%lo = or i64, %u1ls, %t0l

my version ONE has:

%lo = mul i64 %x, %y

and my version THREE has:

%u3 = add i64 %t2, %t1
%u3ls = shl i64 %u3, 32
%lo = add i64 %u3ls, %t0

https://godbolt.org/z/_1pDoz

361

Remove debugging printf?

test/Transforms/AggressiveInstCombine/mul128.ll
8 ↗(On Diff #179921)

Peanut gallery says: I doubt that this test captures everything that you want to test about the optimization. You just check that the output contains mul nuw i128, but what if it contains that instruction plus a bunch more unintended stuff?

But I don't know anything about how LLVM optimizations are usually tested. Maybe this test is fine as-is.

craig.topper added inline comments.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
297

Don't you need to check the type is i64 somewhere? Or did I miss it?

chfast marked 3 inline comments as done.Jan 2 2019, 3:55 PM
chfast added inline comments.
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
267

As mentioned before, currently the optimization matches patterns for low and high independently. Mostly, because I don't know yet what is the best way to combine both.

Currently for low it replaces pattern TWO with ONE. The THREE will not work.

These are great test, will add them to the test suite in a separate review as suggested.

297

Yes, I should. Is there a way to do this check with match(). I have not found any example doing this.

test/Transforms/AggressiveInstCombine/mul128.ll
8 ↗(On Diff #179921)

@lebedev.ri already suggested how to generate better checks.

This also does nothing to guarantee that all(or most) of the instructions will be removed. They could have additional users.

If we're in 32-bit mode then the 128-bit result producing X86 instruction doesn't exist. So this will get expanded to a bunch of smaller multiplies and adds. Do we produce something as good as or better than what we would get if we left the user code alone?

craig.topper added inline comments.Jan 2 2019, 4:03 PM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
297

No. But you can check I.getType()->isIntegerTy(64);

chfast marked 2 inline comments as done.Jan 3 2019, 9:50 AM
chfast added inline comments.
test/Transforms/AggressiveInstCombine/mul128.ll
1 ↗(On Diff #179921)
lebedev.ri added inline comments.Jan 3 2019, 10:27 AM
test/Transforms/AggressiveInstCombine/mul128.ll
1 ↗(On Diff #179921)

(Yep, now this diff just needs to be based ontop of that diff, so the tests show difference)

chfast marked 2 inline comments as done.Jan 3 2019, 10:30 AM
chfast added inline comments.
test/Transforms/AggressiveInstCombine/mul128.ll
1 ↗(On Diff #179921)

Can this be done in Phabricator?

lebedev.ri added inline comments.Jan 3 2019, 10:37 AM
test/Transforms/AggressiveInstCombine/mul128.ll
1 ↗(On Diff #179921)

Phabricator only displays the diff you upload.
If you used git for this, simply keep these two diffs as two consecutive commits,
and upload each one of them separately to their respective reviews.
If svn, no idea.

chfast updated this revision to Diff 180132.Jan 3 2019, 1:32 PM
chfast edited the summary of this revision. (Show Details)

Update with some intermidiate changes.

chfast marked an inline comment as done.Jan 3 2019, 1:46 PM

I did a small update.

I rebased the diff on top of the review with tests.

I focused on merging replacement for low and high parts. The strategy is to instead of blindly replacing the pattern with the single multiplication to first try to find the desired multiplication instruction. This feels quite "manual". And I also have trouble with properly placing the new mul instruction.

The types are not checked yet.
I plan to check the native integer size from DataLayout. So this transform will be applied for i64xi64->i128 when i64 is native.

Answering the question: the CodeGen generates the same pattern (I was fixing some bugs there years ago, will verify that claim later on). I don't see benefit of applying this transform if it is going to be reverted in CodeGen unless you know any optimization what this might enable.

lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
255–265

These must be templates, so lambada will not work until C++14.

chfast updated this revision to Diff 185287.Feb 5 2019, 5:24 AM

Hi again,

I believe I addressed most of the comments. Now HI and LO parts are matched independently, but when both are matched they will use the same i128 multiplication.
Now also DataLayout is checked for the max int size. The pass only replaces multiplication up to 2x native int size. E.g. it will produce max i64 multiplication on 32-bit targets.

The only think left to do is to address the comment about other uses of intermediate values.
The most restrictive approach would be to check all intermediate values if the number of uses matches the pattern?

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 5:24 AM
chfast updated this revision to Diff 185297.Feb 5 2019, 6:28 AM
chfast marked 3 inline comments as done.

Update unit tests.

chfast marked 4 inline comments as done.Feb 5 2019, 6:30 AM
Quuxplusone added inline comments.Feb 5 2019, 9:09 AM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
255

This also does nothing to guarantee that all(or most) of the instructions will be removed.[...] Do we produce something as good as or better than what we would get if we left the user code alone?

The only think left to do is to address the comment about other uses of intermediate values. The most restrictive approach would be to check all intermediate values if the number of uses matches the pattern?

IIUC, this is not a problem with _correctness_, right? We are protected against removing an instruction whose output still has live uses? But we're worried that the intermediate outputs will all have so many uses that we'll end up generating our MUL *and* keeping all those intermediate instructions, and so the codegen will be bigger than if we'd left it alone.

If I've understood the problem correctly, then I think @chfast's proposed solution is correct: you should do this optimization only if every intermediate result is completely dead (or can be replaced by a corresponding intermediate result of the new code). The vast majority of cases where we want this optimization to fire will be cases where all the intermediate results are dead.

280

Here and lines 277 and 290: const auto should be something else (such as [simply auto&&](https://quuxplusone.github.io/blog/2018/12/15/autorefref-always-works/)), or else the const applies only to the copy you made of whatever the element type of U->users() was. I suspect you actually meant const auto * but I'm not sure.

chfast updated this revision to Diff 185816.Feb 7 2019, 10:43 AM

Another round of changes.

I fixed some small defects and added more tests.

I'm also checking the number of uses of different intermediate values. However, this check is not perfect. This is the best I could get in the current design of the pass. The main problem is that I try to match 2 different patterns: mullo (actually has 2 variants) and umulhi. Depending if the go together, the uses count differs. Let me know what you think about the current code.

chfast marked an inline comment as done.Feb 7 2019, 10:45 AM
Quuxplusone added inline comments.Feb 8 2019, 7:29 AM
lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp
392

Nit: This expression isn't intuitively clear to me. Also, I would write uint64_t{0} as uint64_t(0).

I think you're getting really lucky here that MaxSizeInBits / 2 just happens to be the same number of bits (64) as the width of uint64_t{0}; otherwise this math would be wrong.

How about

assert(HalfSizeInBits <= 64);
const auto LowMask = m_SpecificInt((uint64_t(1) << (HalfSizeInBits-1) << 1) - 1);

or if there's an existing utility function to compute uint64_t(1) << HalfSizeInBits directly.

RKSimon added a subscriber: RKSimon.

@chfast What happened to this patch? I was looking at https://llvm.org/PR36243 and wondering if it'd be worth AggressiveInstCombine/InstCombine doing something similar for adds (in that case a 3 chain add i32 to add i96).

Herald added a project: Restricted Project. · View Herald TranscriptMar 13 2022, 8:31 AM

@chfast What happened to this patch? I was looking at https://llvm.org/PR36243 and wondering if it'd be worth AggressiveInstCombine/InstCombine doing something similar for adds (in that case a 3 chain add i32 to add i96).

I don't remember correctly, but I think I have not received clear answer this is worth the complexity.

On the technical level, we should pick a cut-off point. So it may be ok to do the transformation for i128 given i64 is native type (based on data layout?).

So in the addc case I don't think it make sense to match i96 given the biggest native type is i32. Otherwise, you will be matching a lot of integer multi-precision code and move the work to legalization.

In my practice, LLVM handles multi-precision workloads without builtins pretty good as of recently. However, I'm missing generic addc/subc intrinsic (__builtin_addc is implemented by two uaddos).

@chfast What do you want to do with this patch now that D136015 landed?

chfast added a comment.Nov 2 2022, 1:40 AM

@chfast What do you want to do with this patch now that D136015 landed?

I don't really need it any more (I was overoptimistic this is a portable way to get access to 64x64→128 mul instruction). But if you think the change is good and has now flaws I can finish it in some free time. Should it be moved to InstCombine then?

Probably abandon it for now? You can always resurrect it if you find a compelling case.

RKSimon resigned from this revision.Dec 5 2022, 7:07 AM
lebedev.ri resigned from this revision.Jan 12 2023, 5:32 PM

This review may be stuck/dead, consider abandoning if no longer relevant.
Removing myself as reviewer in attempt to clean dashboard.

chfast abandoned this revision.Jan 13 2023, 12:59 AM