This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Add ISEL patterns for Mul with Imm.
ClosedPublic

Authored by Esme on Sep 9 2020, 8:12 AM.

Details

Summary

If the multiplicand is a constant with following formats:

  1. mul with (2^N * int16_imm) -> mulli + rldicr
  2. mul with (2^N + 2^M) -> rldicr + add + rldicr
  3. mul with (2^N - 2^M) -> rldicr + sub + rldicr

Scenario 2 and 3 are moved to DAGCombiner D88201.

Diff Detail

Event Timeline

Esme created this revision.Sep 9 2020, 8:12 AM
Esme requested review of this revision.Sep 9 2020, 8:12 AM
Esme edited the summary of this revision. (Show Details)Sep 9 2020, 7:45 PM
steven.zhang accepted this revision.Sep 16 2020, 1:15 AM

LGTM. Thank you for doing this. I assume that you have run it with bmk to make sure that the functionality is fine.

This revision is now accepted and ready to land.Sep 16 2020, 1:15 AM
jsji added a comment.Sep 17 2020, 11:27 AM

Why this can NOT be done in DAGCombiner by implementing decomposeMulByConstant target hook?

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5039

Can we add comments about all these scenarios? With simple examples.

5049

exactly two bits set ? What do you mean?

Why this can NOT be done in DAGCombiner by implementing decomposeMulByConstant target hook?

Do you have evidence of profitability of (mul (shl %a, N) M) being preferable over (mul %a, C) on other targets? If so, I suppose extending the DAG combine that handles:

mul x, (2^N + 1) --> add (shl x, N), x
mul x, (2^N - 1) --> sub (shl x, N), x

Perhaps it wold be good to gauge interest from other RISC targets.

Scenarios 2 and 3 are quite similar to the existing DAG combine so this would be a straightforward implementation. However I am not sure how likely it is that two shifts and an add/sub are better than a multiply on other targets.

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5039

I agree 100%. We should illustrate this with some example bit patterns.

5049

2^N + 2^M has exactly two bits set.

However, I don't think this is profitable as implemented. This implementation does not benefit from an ILP improvement since the instructions are dependent. What we probably want is to convert
(mul %a, (2^N + 2^M)) to (add (shl %a, N), (shl %a, M)) since that improves ILP so the total latency of the sequence is actually reduced. That would also mean that for something like this:

*res1 = a * 0x8800;
*res2 = a * 0x8080;

we only need three total shifts (we should also add that as a test case).
Similar argument applies to the subtract case below.

Esme added a comment.Sep 21 2020, 8:32 PM

Thank you for your comments! @jsji @nemanjai

For scenarios 2 and 3, I will modified it as @nemanjai 's hint, and move it to DAGCombiner since it is really similar to existing patterns and then implement our PPCTargetLowering::decomposeMulByConstant. If other targets are interested about this, they can add conditions for this in their decomposeMulByConstant.

But for scenario 1, I am not sure if there are benefits to other targets, since it is inspired by our HW instruction MULLI. So I prefer to keep it in ISEL.

Esme updated this revision to Diff 293409.Sep 22 2020, 4:01 AM
Esme edited the summary of this revision. (Show Details)

Keep the scenario 1 implemented in ISEL.
For scenario 2 and 3, I will post another patch to implement them in DAGCombiner.

jsji added a comment.EditedSep 22 2020, 11:45 AM

I think we already have similar pattern for scenario 1 as well:

// Change (mul (shl X, C), Y) -> (shl (mul X, Y), C) when the shift has one
// use.

Just that (shl X,C) is not constant there..

So I would assume dealing with similar situation controlled by TLI.decomposeMulByConstantwill be easy and also no harm to other targets.
Targets can control and add this scenario in their TLI if they also want this.

BTW: looks like x86 also has imul.

llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp
5038

Maybe it would be clearer if we use DAG expression in comments.
eg: (mul X, c2 << c1) -> (rldicr (mulli X, c2 >> c1) c1)

Esme added a comment.EditedSep 23 2020, 12:55 AM

Hi @jsji An infinite loop will occur if we handle the scenario 1 (mul X, c2 << c1) -> (shl (mul X, c2), c1) in DAGCombiner, because there exists a reverse conversion (shl (mul x, c1), c2) -> (mul x, c1 << c2).

jsji added a comment.Sep 23 2020, 6:06 AM

Hi @jsji An infinite loop will occur if we handle the scenario 1 (mul X, c2 << c1) -> (mul (shl X, c1), c2) in DAGCombiner, because there exists a reverse conversion (mul (shl X, c1), c2) -> (mul X, c2 << c1).

Hmm.. then OK to keep this in ISEL, but please add comments about DAGcombiner prefer (mul X, c2 << c1). Thanks

Esme updated this revision to Diff 293950.Sep 24 2020, 12:29 AM

Update comments.

Esme edited the summary of this revision. (Show Details)Sep 24 2020, 12:29 AM
Esme updated this revision to Diff 293968.Sep 24 2020, 1:32 AM
Esme edited the summary of this revision. (Show Details)
Esme added a comment.Sep 24 2020, 1:36 AM

Thank you so much for your comments. :D @jsji @nemanjai
I have post another patch D88201 to handle scenario 2 and 3 in DAGCombiner.

Esme updated this revision to Diff 293996.Sep 24 2020, 3:08 AM

Format.

Esme added a comment.Oct 1 2020, 9:30 AM

Looking forward to your further comments :) @jsji @nemanjai

jsji accepted this revision.Nov 6 2020, 7:52 AM

LGTM. Thanks for improving this!

This revision was automatically updated to reflect the committed changes.