Page MenuHomePhabricator

[X86] Reduce the width of multiplification when its operands are extended from i8 or i16
ClosedPublic

Authored by wmi on Jun 2 2016, 2:36 PM.

Details

Summary

For <N x i32> type mul, pmuludq will be used for targets without SSE41, which often introduces many extra pack and unpack instructions in vectorized loop body because pmuludq generates <N/2 x i64> type value. However when the operands of <N x i32> mul are extended from smaller size values like i8 and i16, the type of mul may be shrinked to use pmullw + pmulhw/pmulhuw instead of pmuludq, which generates better code. For targets with SSE41, pmulld is supported so no shrinking is needed.

Diff Detail

Event Timeline

wmi updated this revision to Diff 59458.Jun 2 2016, 2:36 PM
wmi retitled this revision from to [X86] Reduce the width of multiplification when its operands are extended from i8 or i16.
wmi updated this object.
wmi added reviewers: hfinkel, RKSimon, congh.
wmi set the repository for this revision to rL LLVM.
wmi added subscribers: llvm-commits, davidxl, mkuper.

The testcases are a bit more complicated than they need to be; could you reduce them to just the minimal IR? In particular, including the loops makes it harder to read.

Also, missing testcase for the multiply-by-constant case.

lib/Target/X86/X86ISelLowering.cpp
26450 ↗(On Diff #60628)

"shrunk", not "shrinked"... but you might want to use "narrowed" here instead.

26477 ↗(On Diff #60628)

What if one side is signed, and the other is unsigned?

26493 ↗(On Diff #60628)

Repeated code; should be refactored.

26507 ↗(On Diff #60628)

This is not what you want. The constant vector is "zero-extended" if ZEXT(TRUNC(vec))==vec, and "sign-extended" if SEXT(TRUNC(vec)) == vec. Computing whether the vector is a splat has no relation to either of those properties.

26612 ↗(On Diff #60628)

There's massive code duplication here; needs to be refactored. Maybe make splitting the inputs if necessary and actually computing the product separate steps?

mkuper added inline comments.Jun 2 2016, 3:13 PM
lib/Target/X86/X86ISelLowering.cpp
26452 ↗(On Diff #60628)

Any chance to split this function up? Or is there no logical way to do that?

26468 ↗(On Diff #60628)

I'm not sure DONTCARE is a good name for this - it's not a "top" value. Maybe UNKNOWN?

26475 ↗(On Diff #60628)

ISD::ANY_EXTEND would work too, right?
(Not sure how to test this, but if both ZERO and SIGN work, ANY should definitely work.)

26481 ↗(On Diff #60628)

What if it's MVT::i1?
You may want an "else return SDValue()" here as well.

26486 ↗(On Diff #60628)

Same two comments as above.

26494 ↗(On Diff #60628)

What happens if one of the extends is a sext and the other is a zext? It seems like if N0 is a sext and N1 is a zext we'll get IsSigned == true, and if it's the other way around, we'll get IsSigned == false, which seems oddly asymmetrical. Or is there a canonicalization earlier that guarantees the order of a sext and zext?

26499 ↗(On Diff #60628)

Are you guaranteed that if one of {N0, N1} is a extend, and the other is a BuildVector, N0 is the extend? I can imagine something canonizes it that way - and if that's the case, documenting this here would probably be a good idea.

26519 ↗(On Diff #60628)

Why not use N0->getOperand(0), at least when the type is i16?
Although I guess we're pretty much guaranteed DAGCombine will clean this up, and this may be a bit cleaner as is. So, I'm not really sure which is better.

26523 ↗(On Diff #60628)

This is necessary because this runs before legalization, right?
What if it ran after legalization? Does that help? Or is that too late?

26544 ↗(On Diff #60628)

Why not use a generic shuffle here? Is the lowering further down not good enough to get the right unpacks?

29799 ↗(On Diff #60628)

Why the linebreak?

test/CodeGen/X86/shrink_vmul.ll
2 ↗(On Diff #60628)

Could you add sext tests as well? I'm not sure you need the whole type x {zext, sext} matrix, but one sext test would be good.

wmi added a comment.Jun 2 2016, 5:46 PM

Eli and Michael, thanks for your comments. Will post a new patch after I address the issues mentioned.

lib/Target/X86/X86ISelLowering.cpp
26450 ↗(On Diff #60628)

Will fix it.

26452 ↗(On Diff #60628)

I can extract the pattern matching part to a separate func. Do you think it is enough?

26468 ↗(On Diff #60628)

That is better. Will fix it.

26475 ↗(On Diff #60628)

Yes, it should work. I think I can generate the same code for it with ISD::ZERO_EXTEND.

26477 ↗(On Diff #60628)

Nice catch. I think then the type of the mul should be signed.

26481 ↗(On Diff #60628)

You are right. Will fix it.

26493 ↗(On Diff #60628)

Will fix it.

26494 ↗(On Diff #60628)

Eli asked a similar question. If one side is sext, and the other is zext, then the type of mult should be signed. I will fix it.

26499 ↗(On Diff #60628)

Yes, InstCombine will canonicalize it. Will add comment for it.

26507 ↗(On Diff #60628)

Maybe I can use SplatValue.isNegative() to get the signedness of constant? Then use it together with the signedness of N0 to determine the signedness of multiplification.

26519 ↗(On Diff #60628)

Yes, I tried to make the code simpler here.

26523 ↗(On Diff #60628)

Yes, it is necessary to run before type legalization. That is because suppose original mul is of type <16 x i32>, it will be splitted into four muls of type <4 x i32> in legalization. We actually need after the transformation is two muls of type <8 x i16>, it is more difficult to merge pairs of muls if it is done after legalization.

26544 ↗(On Diff #60628)

Because I feel using generic shuffle doesn't make the code simpler. We still need two shuffles, two bitcast and one concat_vectors.

26612 ↗(On Diff #60628)

I did see there were some code duplication. Like I can merge the code for "VT.getVectorNumElements() > OpsVT.getVectorNumElements()" and that for "VT.getVectorNumElements() == OpsVT.getVectorNumElements()", but I felt it made the logic little bit less clear after the merge, which mixed the case requiring split with the case requiring no split.

Probably still worth it. I will do it and add some comments to clarify the logic.

29799 ↗(On Diff #60628)

It is done by clang-format. I think it is more consistent without the linebreak. I will fix it.

test/CodeGen/X86/shrink_vmul.ll
2 ↗(On Diff #60628)

Will add it.

mkuper added inline comments.Jun 2 2016, 5:59 PM
lib/Target/X86/X86ISelLowering.cpp
26452 ↗(On Diff #60628)

Yes, together with the refactoring Eli suggested, that should be good.

26507 ↗(On Diff #60628)

I think what Eli meant is that there's no reason to look for a splat, specifically. If every element is small enough, it doesn't matter whether the vector is a splat or not.

26519 ↗(On Diff #60628)

As long as this gets clean up, that sounds reasonable.

26523 ↗(On Diff #60628)

Ah, ok, got it, thanks.

26544 ↗(On Diff #60628)

It will actually make the code a bit more complex, really. What I had in mind was that if the result of the mul itself feeds a shuffle, leaving generic shuffles here may expose more dag combines further down the road.

wmi updated this revision to Diff 59745.Jun 6 2016, 10:23 AM

Addressed Eli and Michael's comments.

A major change is: The previous way to choose shrinking modes by only looking at sext and zext is incorrect. I should analyze and use the value ranges of mul operands to determine which shrinking mode should be chosen. Different shrinking modes and their allowed value ranges are described in the comment of reduceVMULWidth.

This is looking a lot better overall.

lib/Target/X86/X86ISelLowering.cpp
26537 ↗(On Diff #60628)

It feels like you should be able to use ComputeNumSignBits/computeKnownBits here; I'm not sure how much shorter that actually ends up, though.

26675 ↗(On Diff #60628)

It's not obvious to me why you're explicitly legalizing this here; you could just generate a MUL on, for example, <4 x i16> and legalization should do the right thing from there.

test/CodeGen/X86/shrink_vmul.ll
752 ↗(On Diff #60628)

It would probably be more clear to write this as mul nuw nsw <2 x i32> %tmp8, <i32 -32768, i32 32767>.

wmi added inline comments.Jun 7 2016, 10:44 AM
lib/Target/X86/X86ISelLowering.cpp
26537 ↗(On Diff #60628)

Thanks for the suggestion. I change the value range check of intconst to use ComputeNumSignBits. The code length doesn't change much. But ComputeNumSignBits is more powerful, I can use it to do some extension further in the future, like,

; %val1 = load <2 x i8>
; %op1 = zext<2 x i32> %val1
; %val2 = load <2 x i8>
; %op2 = zext<2 x i32> %val2
; %add = add <2 x i32> %op1, %op2
; %rst = mul <2 x i32> %add, %op2

ComputeNumSignBits may know %add's value range is within 0 ~ 32767 (Actually 0 ~ 255*2).

26675 ↗(On Diff #60628)

I choose to explicitly legalize here because implicit legalization will generate different results:

Suppose the input is <4 x i16>, for implicit legalization, it will be converted to <4 x i64> then bitcast to <8 x i16> before being used as the input of pmullw. If the input is a vector load + sext/zext, then the input needs to be unpck twice to get <4 x i64>.

For explicit legalization, I choose to concat <4 x i16> with vector undef to get <8 x i16>. If the input is a vector load + sext/zext, then the input can be directly used as the input of pmullw.

test/CodeGen/X86/shrink_vmul.ll
752 ↗(On Diff #60628)

That is better. Fixed.

wmi updated this revision to Diff 59912.Jun 7 2016, 10:45 AM
eli.friedman added inline comments.Jun 7 2016, 5:22 PM
lib/Target/X86/X86ISelLowering.cpp
26506 ↗(On Diff #60628)

It's probably more clear to express this in terms of the number of sign bits and the number of leading zeros (APInt::countLeadingZeros). Actually, you could probably get rid of the ValRange enumeration altogether in favor of those two numbers. For example, return MULS8 if std::min(signbits1, signbits2) > 24.

26675 ↗(On Diff #60628)

I'm not following... how do you get to <4 x i64>? Legalization of a <4 x i16> multiply will widen it to an <8 x i16> multiply; this codepath already gets used for IR like mul <4 x i16> %a, %b.

wmi added inline comments.Jun 7 2016, 11:26 PM
lib/Target/X86/X86ISelLowering.cpp
26506 ↗(On Diff #60628)

I rewrite it and the code is much shorter. Thanks for the suggestion.

26675 ↗(On Diff #60628)

Sorry. I wanted to say <4 x i32> instead of <4 x i64>. when legalizing <4 x i16> to <4 x i32>, it will use a punpcklwd instruction if the input is load <4 x i16>. It is different from widening <4 x i16> to <8 x i16> by filling undef in the higher bits.

wmi updated this revision to Diff 59999.Jun 7 2016, 11:27 PM
eli.friedman added inline comments.Jun 8 2016, 12:26 AM
lib/Target/X86/X86ISelLowering.cpp
26673 ↗(On Diff #60628)

Ah, I see what you mean. That's how we end up in an awful mess with the following:

typedef short a __attribute((ext_vector_type(4)));
void g(a x);
a f(a x, a y, int c) { a z = x*y; if (c) g(z+x); return z; }

That doesn't explain why you need to explicitly legalize the inputs in the case where you split the nodes, though.

wmi added inline comments.Jun 8 2016, 12:02 PM
lib/Target/X86/X86ISelLowering.cpp
26673 ↗(On Diff #60628)

I choosed to explicitly legalize because I used X86ISD::UNPCKL and X86ISD::UNPCKH instead of vector_shuffle so no mask setting was needed. Seems legalization only works for generic ISD instead of X86ISD.

Actually it doesn't need to legalize. I change unpck to vectorshuffle and remove the splitting. The code looks simpler even with the additional mask setting code.

Thanks for the suggestion.

wmi updated this revision to Diff 60077.Jun 8 2016, 12:04 PM
eli.friedman added inline comments.Jun 8 2016, 12:53 PM
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
673 ↗(On Diff #60628)

Do you need MULHS here too? (Maybe missing a test for this?)

lib/Target/X86/X86ISelLowering.cpp
26453 ↗(On Diff #60628)

This comment is confusing; the operand isn't actually guaranteed to be between 0 and 127. (The transform is safe because we can assume an appropriate number of leading sign/zero bits.)

26478 ↗(On Diff #60628)

Probably better to use APInt::getNumSignBits here.

wmi added inline comments.Jun 8 2016, 1:33 PM
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
673 ↗(On Diff #60628)

Ah sorry. Fixed. Add a test mul_16xi16_sext for it.

lib/Target/X86/X86ISelLowering.cpp
26453 ↗(On Diff #60628)

Fixed.

26478 ↗(On Diff #60628)

Fixed.

wmi updated this revision to Diff 60097.Jun 8 2016, 1:33 PM

LGTM, but this has been though a lot of revisions, so it's probably a good idea to someone else to double-check I didn't miss something obvious.

wmi added a subscriber: wmi.Jun 8 2016, 2:33 PM

Eli, thanks for your many helpful suggestions.

mkuper accepted this revision.Jun 13 2016, 11:51 AM
mkuper added a reviewer: mkuper.

LGTM too.
I have a couple of comments, but they're rather half-baked, feel free to ignore them if they don't make sense to you.

lib/Target/X86/X86ISelLowering.cpp
26446 ↗(On Diff #60628)

Nit - it's a bit misleading to have N->getNumOperands() here, and then index into an array of size 2.
Maybe assert N->getNumOperands() == 2 (or just use 2 here, but that's less self-documenting, I guess).

26470 ↗(On Diff #60628)

Perhaps use ComputeNumSignBits on the BUILD_VECTOR elements as well, instead of checking for const/undef?
Although I'm really not sure if that gains anything in practice, and it won't make the code significantly smaller, so it's may not be worth the overhead.

26480 ↗(On Diff #60628)

Maybe computeKnownBits instead of special-casing ZERO_EXTEND?
Although, the same as above applies, not at all sure it's worth the overhead.

This revision is now accepted and ready to land.Jun 13 2016, 11:51 AM
wmi added inline comments.Jun 13 2016, 4:42 PM
lib/Target/X86/X86ISelLowering.cpp
26446 ↗(On Diff #60628)

That is misleading indeed. Fixed and Added an assertion.

26470 ↗(On Diff #60628)

ComputeNumSignBits doesn't work for BUILD_VECTOR for now. It always return 1.

26480 ↗(On Diff #60628)

The code seems longer and costlier that way, so I choose to special-case ZERO_EXTEND here.

wmi updated this revision to Diff 60628.Jun 13 2016, 4:43 PM
wmi edited edge metadata.
This revision was automatically updated to reflect the committed changes.