Page MenuHomePhabricator

[DAGCombiner] Combine shifts into multiply-high
ClosedPublic

Authored by amyk on Apr 15 2020, 10:03 PM.

Details

Summary

This patch implements a target independent DAG combine to produce multiply-high
instructions from shifts. This DAG combine will combine shifts for any type as
long as the MULH on the narrow type is legal.

For now, it is enabled on PowerPC as PowerPC is the only target that has an
implementation of the isMulhCheaperThanMulShift TLI hook introduced in
D78271.

Moreover, this DAG combine focuses on catching the pattern:

(shift (mul (ext <narrow_type>:$a to <wide_type>), (ext <narrow_type>:$b to <wide_type>)), <narrow_width>)

to produce mulhs when we have a sign-extend, and mulhu when we have
a zero-extend.

The patch performs the following checks:

  • Operation is a right shift arithmetic (sra) or logical (srl)
  • Input to the shift is a multiply
  • Both operands to the shift are sext/zext nodes
  • The extends into the multiply are both the same
  • The narrow type is half the width of the wide type
  • The shift amount is the width of the narrow type
  • The respective mulh operation is legal

Diff Detail

Event Timeline

amyk created this revision.Apr 15 2020, 10:03 PM
lebedev.ri added inline comments.
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15918 ↗(On Diff #257956)

This doesn't look like the best check for this.
Do we not want this transform in general, is it expected to be pessimizing somewhere?
Then i'd expect it to be a TLI hook.

amyk marked an inline comment as done.Apr 16 2020, 9:24 PM
amyk added inline comments.
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15918 ↗(On Diff #257956)

I actually realize now after doing some testing that it may be better to remove this check and to run this transformation in other passes, as well. I will update the patch to reflect this. Thanks for reviewing.

amyk updated this revision to Diff 258231.Apr 16 2020, 9:27 PM

Updated the patch so that the combine for shifts runs not only before type legalization, but in subsequent passes, too.

anil9 added a subscriber: anil9.Apr 20 2020, 9:19 PM
anil9 added inline comments.
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15849 ↗(On Diff #258231)

niit: combineShifttoMULH -> combineShiftToMULH

amyk updated this revision to Diff 259109.Apr 21 2020, 2:52 PM

Address comment to rename combineShifttoMULH to combineShiftToMULH.

amyk marked an inline comment as done.Apr 21 2020, 2:52 PM

Is there anything that would stop us making this a generic combine in DAGCombiner?

nemanjai requested changes to this revision.Apr 24 2020, 6:08 AM

The description mentions i32/i64 explicitly, but there does not seem to be anything in the combine that narrows it down to those two types. In fact, it appears that it will combine it for any type (scalar or vector) as long as the MULH on the narrow type is legal.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15876 ↗(On Diff #259109)

You will probably need something like (void)WideVT2; to silence warnings on non-assert builds. Or you can just not define WideVT2, not have the assert and trust that the well-formedness of the multiply is adequately checked in target-independent code.

15888 ↗(On Diff #259109)

I don't understand this. Why do we assume that either ShiftAmtSrc is constant or its second operand is constant? What node do you expect it to be if it is not constant?
Also, won't this crash on (srl (mul (zext i32:%a to i64), (zext i32:%b to i64)), %c)?

i.e. something like:

unsigned test(unsigned a, unsigned b, unsigned c) {
  return (unsigned) (((uint64_t)a * b) >> c);
}
This revision now requires changes to proceed.Apr 24 2020, 6:08 AM

The description mentions i32/i64 explicitly, but there does not seem to be anything in the combine that narrows it down to those two types. In fact, it appears that it will combine it for any type (scalar or vector) as long as the MULH on the narrow type is legal.

Just to be clear, I am suggesting the description should be updated to match the combine, not the other way around.

amyk updated this revision to Diff 262264.May 5 2020, 5:10 PM
amyk edited the summary of this revision. (Show Details)

Addressed review comments, and updated the summary for this patch.

amyk added a comment.May 5 2020, 6:17 PM

Is there anything that would stop us making this a generic combine in DAGCombiner?

Is there a preference to have this target independent instead? It may be possible if there is a preference/demand for it to be.

Is there anything that would stop us making this a generic combine in DAGCombiner?

Is there a preference to have this target independent instead? It may be possible if there is a preference/demand for it to be.

@craig.topper - any thoughts? Given how expensive PMULLD/PMULLQ can be, we don't much to encourage PMULH generation.

Is there anything that would stop us making this a generic combine in DAGCombiner?

Is there a preference to have this target independent instead? It may be possible if there is a preference/demand for it to be.

@craig.topper - any thoughts? Given how expensive PMULLD/PMULLQ can be, we don't much to encourage PMULH generation.

We have a combine similar to this in X86 back it starts at the truncate not the shift.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15882 ↗(On Diff #262264)

SIGN_EXTEND_INREG will never pass this check will it? The input and output type for that are the same. There's an extra operand carrying the type to extend from.

craig.topper added inline comments.May 6 2020, 12:15 PM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15905 ↗(On Diff #262264)

I think what extend to use at the end needs to be base of the shift opcode not the extend opcode. If its an SRL, you need to put 0s in the upper bits even if the multiply is MULHS.

craig.topper added inline comments.May 6 2020, 9:47 PM
llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15904 ↗(On Diff #262264)

I don't see a check that RightOp.getOperand(0) and LeftOp.getOperand(0) are the the same type?

amyk marked 3 inline comments as done.May 7 2020, 12:53 PM

@craig.topper Do you think common-ing out the X86/PPC parts to combine to mulh into the target independent combiner is a good idea?

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
15882 ↗(On Diff #262264)

That's a good point. I thought I had tests involving SIGN_EXTEND_INREG that works with this, but I realize now that I actually don't and they're all sign extends for this patch. I've decided to move the check for SIGN_EXTEND_INREG.

15904 ↗(On Diff #262264)

That's true, thank you for pointing that out.

15905 ↗(On Diff #262264)

You're right, I'll fix that.

@craig.topper Do you think common-ing out the X86/PPC parts to combine to mulh into the target independent combiner is a good idea?

I see a few issues.

X86 doesn't have scalar MULHU/MULHS instructions, but we have vector MULHU/MULHS on vXi16. We currently match from truncate rather than from shift. I tried to move it to shift following your code here, but I got regressions. Primarily because we handled the truncate first and turned into PACKSS/PACKUS, may an AND or SIGN_EXTEND_INREG, and some subvector extracts.. Then we match the MULH, but we couldn't fold away the sext/zext and what we had turned the truncate into. We might be able to improve that. I think it is useful to match from the shift since the truncate won't always be there. So we might need matching from both.

The other issue is that for vectors we need to match for vectors width more than the legal number of elements before type legalization. Otherwise the extends we match get type legalized into something much harder to match. But checking isOperationLegal won't work for that. Maybe we could walk the type legalization steps to find what it would be legalized to?

amyk updated this revision to Diff 262813.May 7 2020, 9:57 PM

Updated the diff of this patch to address comments that were raised previously.

@craig.topper Do you think common-ing out the X86/PPC parts to combine to mulh into the target independent combiner is a good idea?

I see a few issues.

X86 doesn't have scalar MULHU/MULHS instructions, but we have vector MULHU/MULHS on vXi16. We currently match from truncate rather than from shift. I tried to move it to shift following your code here, but I got regressions. Primarily because we handled the truncate first and turned into PACKSS/PACKUS, may an AND or SIGN_EXTEND_INREG, and some subvector extracts.. Then we match the MULH, but we couldn't fold away the sext/zext and what we had turned the truncate into. We might be able to improve that. I think it is useful to match from the shift since the truncate won't always be there. So we might need matching from both.

It sounds like if we were to put this code into target-independent DAG Combine, you shouldn't see similar regressions since your combines will still run and this patch might pick up some of what couldn't be combined. Or am I misinterpreting your comment?

The other issue is that for vectors we need to match for vectors width more than the legal number of elements before type legalization. Otherwise the extends we match get type legalized into something much harder to match. But checking isOperationLegal won't work for that. Maybe we could walk the type legalization steps to find what it would be legalized to?

If I am not mistaken, this runs before legalization as well so it should combine something like
(srl (mul (zext v16i16:$a to v16i32), (zext v16i16:$a to v16i32)), (splat 16 to v16i32))
As long as MULH is legal for v16i16 (even though v16i32 is not a legal type). Of course, the actual types just for illustration - not to suggest that those are the actual types for any specific target.
Or am I again not reading your comment correctly?

@craig.topper Do you think common-ing out the X86/PPC parts to combine to mulh into the target independent combiner is a good idea?

I see a few issues.

X86 doesn't have scalar MULHU/MULHS instructions, but we have vector MULHU/MULHS on vXi16. We currently match from truncate rather than from shift. I tried to move it to shift following your code here, but I got regressions. Primarily because we handled the truncate first and turned into PACKSS/PACKUS, may an AND or SIGN_EXTEND_INREG, and some subvector extracts.. Then we match the MULH, but we couldn't fold away the sext/zext and what we had turned the truncate into. We might be able to improve that. I think it is useful to match from the shift since the truncate won't always be there. So we might need matching from both.

It sounds like if we were to put this code into target-independent DAG Combine, you shouldn't see similar regressions since your combines will still run and this patch might pick up some of what couldn't be combined. Or am I misinterpreting your comment?

Yes our combine will still run. I have no issue putting this in target-independent combine. I was answering from the position of why putting it in target combine doesn't get rid of X86 specific code. Which I assume was at least part of @RKSimon's motivation for moving it.

The other issue is that for vectors we need to match for vectors width more than the legal number of elements before type legalization. Otherwise the extends we match get type legalized into something much harder to match. But checking isOperationLegal won't work for that. Maybe we could walk the type legalization steps to find what it would be legalized to?

If I am not mistaken, this runs before legalization as well so it should combine something like
(srl (mul (zext v16i16:$a to v16i32), (zext v16i16:$a to v16i32)), (splat 16 to v16i32))
As long as MULH is legal for v16i16 (even though v16i32 is not a legal type). Of course, the actual types just for illustration - not to suggest that those are the actual types for any specific target.
Or am I again not reading your comment correctly?

But it's still probably worthwhile for illegal types like v128i16 since it turns two extends on the inputs to one on the output. Type legalization will split the v128i16 MULHU/MULHS. This needs to be matched before type legalization makes the v128i32 zero extend hard to spot. Number of elements exaggerated to make it look it likely illegal.

Yes our combine will still run. I have no issue putting this in target-independent combine. I was answering from the position of why putting it in target combine doesn't get rid of X86 specific code. Which I assume was at least part of @RKSimon's motivation for moving it.

Yes this was a query as to whether it would help on top of the existing x86 specific combines.

Since it would seem that this can't immediately be combined with the only other target that seems to want this, I would recommend that we keep this in the PPC back end for now. If there is interest in commoning it up in the future, we revisit this then.
Does that sound like a good plan?
And of course, thanks for all your feedback @craig.topper @RKSimon!

Since it would seem that this can't immediately be combined with the only other target that seems to want this, I would recommend that we keep this in the PPC back end for now. If there is interest in commoning it up in the future, we revisit this then.
Does that sound like a good plan?
And of course, thanks for all your feedback @craig.topper @RKSimon!

No objections from me - sorry for the delay!

lenary added a subscriber: lenary.May 13 2020, 3:26 AM

Since it would seem that this can't immediately be combined with the only other target that seems to want this, I would recommend that we keep this in the PPC back end for now. If there is interest in commoning it up in the future, we revisit this then.
Does that sound like a good plan?
And of course, thanks for all your feedback @craig.topper @RKSimon!

I can see this combine being useful on RISC-V (where we have a mulh[[s]u] instruction) - would it be useful for me to work on some testcases for it?

I can see this combine being useful on RISC-V (where we have a mulh[[s]u] instruction) - would it be useful for me to work on some testcases for it?

Absolutely. If we have a test case (along with invocation and desired codegen) it would make it much easier to move this out to the target independent code and ensure it is doing what it is supposed to. Thank you.

amyk updated this revision to Diff 266233.EditedMay 26 2020, 8:46 AM
amyk retitled this revision from [PowerPC] DAG Combine to transform shifts into multiply-high to [DAGCombiner] Combine shifts into multiply-high .
amyk edited the summary of this revision. (Show Details)

Moved the function to combine shifts into multiply high into DAGCombiner.cpp. For now, this combine runs only on PowerPC as PowerPC has an implementation of the isMulhCheaperThanMulShift TLI query.

@nemanjai Any more comments?

LGTM. My only remaining comments are nits about early exits from the function not happening as early as they can, but such reordering is trivial and can happen on the commit. Thank you.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7948

I think the check for the types of the inputs to the two extend nodes belongs here. Generally favour early exits as soon as the necessary information is available.

7960

This check is on the outermost operation (i.e. the shift). We can probably move this early exit towards the very top of the function as it makes no sense to do any other checks if the shift amount isn't constant.
I don't think another round of review is required for this though - feel free to address this on the commit.

nemanjai accepted this revision.Jun 2 2020, 5:16 AM

Now if only I remember to select "Accept Revision"... :)

This revision is now accepted and ready to land.Jun 2 2020, 5:16 AM
This revision was automatically updated to reflect the committed changes.