For <N x i32> type mul, pmuludq will be used for targets without SSE41, which often introduces many extra pack and unpack instructions in vectorized loop body because pmuludq generates <N/2 x i64> type value. However when the operands of <N x i32> mul are extended from smaller size values like i8 and i16, the type of mul may be shrinked to use pmullw + pmulhw/pmulhuw instead of pmuludq, which generates better code. For targets with SSE41, pmulld is supported so no shrinking is needed.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
The testcases are a bit more complicated than they need to be; could you reduce them to just the minimal IR? In particular, including the loops makes it harder to read.
Also, missing testcase for the multiply-by-constant case.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26450 | "shrunk", not "shrinked"... but you might want to use "narrowed" here instead. | |
26477 | What if one side is signed, and the other is unsigned? | |
26493 | Repeated code; should be refactored. | |
26507 | This is not what you want. The constant vector is "zero-extended" if ZEXT(TRUNC(vec))==vec, and "sign-extended" if SEXT(TRUNC(vec)) == vec. Computing whether the vector is a splat has no relation to either of those properties. | |
26612 | There's massive code duplication here; needs to be refactored. Maybe make splitting the inputs if necessary and actually computing the product separate steps? |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26452 | Any chance to split this function up? Or is there no logical way to do that? | |
26468 | I'm not sure DONTCARE is a good name for this - it's not a "top" value. Maybe UNKNOWN? | |
26475 | ISD::ANY_EXTEND would work too, right? | |
26481 | What if it's MVT::i1? | |
26486 | Same two comments as above. | |
26494 | What happens if one of the extends is a sext and the other is a zext? It seems like if N0 is a sext and N1 is a zext we'll get IsSigned == true, and if it's the other way around, we'll get IsSigned == false, which seems oddly asymmetrical. Or is there a canonicalization earlier that guarantees the order of a sext and zext? | |
26499 | Are you guaranteed that if one of {N0, N1} is a extend, and the other is a BuildVector, N0 is the extend? I can imagine something canonizes it that way - and if that's the case, documenting this here would probably be a good idea. | |
26519 | Why not use N0->getOperand(0), at least when the type is i16? | |
26523 | This is necessary because this runs before legalization, right? | |
26544 | Why not use a generic shuffle here? Is the lowering further down not good enough to get the right unpacks? | |
29799 | Why the linebreak? | |
test/CodeGen/X86/shrink_vmul.ll | ||
2 | Could you add sext tests as well? I'm not sure you need the whole type x {zext, sext} matrix, but one sext test would be good. |
Eli and Michael, thanks for your comments. Will post a new patch after I address the issues mentioned.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26450 | Will fix it. | |
26452 | I can extract the pattern matching part to a separate func. Do you think it is enough? | |
26468 | That is better. Will fix it. | |
26475 | Yes, it should work. I think I can generate the same code for it with ISD::ZERO_EXTEND. | |
26477 | Nice catch. I think then the type of the mul should be signed. | |
26481 | You are right. Will fix it. | |
26493 | Will fix it. | |
26494 | Eli asked a similar question. If one side is sext, and the other is zext, then the type of mult should be signed. I will fix it. | |
26499 | Yes, InstCombine will canonicalize it. Will add comment for it. | |
26507 | Maybe I can use SplatValue.isNegative() to get the signedness of constant? Then use it together with the signedness of N0 to determine the signedness of multiplification. | |
26519 | Yes, I tried to make the code simpler here. | |
26523 | Yes, it is necessary to run before type legalization. That is because suppose original mul is of type <16 x i32>, it will be splitted into four muls of type <4 x i32> in legalization. We actually need after the transformation is two muls of type <8 x i16>, it is more difficult to merge pairs of muls if it is done after legalization. | |
26544 | Because I feel using generic shuffle doesn't make the code simpler. We still need two shuffles, two bitcast and one concat_vectors. | |
26612 | I did see there were some code duplication. Like I can merge the code for "VT.getVectorNumElements() > OpsVT.getVectorNumElements()" and that for "VT.getVectorNumElements() == OpsVT.getVectorNumElements()", but I felt it made the logic little bit less clear after the merge, which mixed the case requiring split with the case requiring no split. Probably still worth it. I will do it and add some comments to clarify the logic. | |
29799 | It is done by clang-format. I think it is more consistent without the linebreak. I will fix it. | |
test/CodeGen/X86/shrink_vmul.ll | ||
2 | Will add it. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26452 | Yes, together with the refactoring Eli suggested, that should be good. | |
26507 | I think what Eli meant is that there's no reason to look for a splat, specifically. If every element is small enough, it doesn't matter whether the vector is a splat or not. | |
26519 | As long as this gets clean up, that sounds reasonable. | |
26523 | Ah, ok, got it, thanks. | |
26544 | It will actually make the code a bit more complex, really. What I had in mind was that if the result of the mul itself feeds a shuffle, leaving generic shuffles here may expose more dag combines further down the road. |
Addressed Eli and Michael's comments.
A major change is: The previous way to choose shrinking modes by only looking at sext and zext is incorrect. I should analyze and use the value ranges of mul operands to determine which shrinking mode should be chosen. Different shrinking modes and their allowed value ranges are described in the comment of reduceVMULWidth.
This is looking a lot better overall.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26537 | It feels like you should be able to use ComputeNumSignBits/computeKnownBits here; I'm not sure how much shorter that actually ends up, though. | |
26675 | It's not obvious to me why you're explicitly legalizing this here; you could just generate a MUL on, for example, <4 x i16> and legalization should do the right thing from there. | |
test/CodeGen/X86/shrink_vmul.ll | ||
752 | It would probably be more clear to write this as mul nuw nsw <2 x i32> %tmp8, <i32 -32768, i32 32767>. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26537 | Thanks for the suggestion. I change the value range check of intconst to use ComputeNumSignBits. The code length doesn't change much. But ComputeNumSignBits is more powerful, I can use it to do some extension further in the future, like, ; %val1 = load <2 x i8> ComputeNumSignBits may know %add's value range is within 0 ~ 32767 (Actually 0 ~ 255*2). | |
26675 | I choose to explicitly legalize here because implicit legalization will generate different results: Suppose the input is <4 x i16>, for implicit legalization, it will be converted to <4 x i64> then bitcast to <8 x i16> before being used as the input of pmullw. If the input is a vector load + sext/zext, then the input needs to be unpck twice to get <4 x i64>. For explicit legalization, I choose to concat <4 x i16> with vector undef to get <8 x i16>. If the input is a vector load + sext/zext, then the input can be directly used as the input of pmullw. | |
test/CodeGen/X86/shrink_vmul.ll | ||
752 | That is better. Fixed. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26506 | It's probably more clear to express this in terms of the number of sign bits and the number of leading zeros (APInt::countLeadingZeros). Actually, you could probably get rid of the ValRange enumeration altogether in favor of those two numbers. For example, return MULS8 if std::min(signbits1, signbits2) > 24. | |
26675 | I'm not following... how do you get to <4 x i64>? Legalization of a <4 x i16> multiply will widen it to an <8 x i16> multiply; this codepath already gets used for IR like mul <4 x i16> %a, %b. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26506 | I rewrite it and the code is much shorter. Thanks for the suggestion. | |
26675 | Sorry. I wanted to say <4 x i32> instead of <4 x i64>. when legalizing <4 x i16> to <4 x i32>, it will use a punpcklwd instruction if the input is load <4 x i16>. It is different from widening <4 x i16> to <8 x i16> by filling undef in the higher bits. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26673 | Ah, I see what you mean. That's how we end up in an awful mess with the following: typedef short a __attribute((ext_vector_type(4))); void g(a x); a f(a x, a y, int c) { a z = x*y; if (c) g(z+x); return z; } That doesn't explain why you need to explicitly legalize the inputs in the case where you split the nodes, though. |
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26673 | I choosed to explicitly legalize because I used X86ISD::UNPCKL and X86ISD::UNPCKH instead of vector_shuffle so no mask setting was needed. Seems legalization only works for generic ISD instead of X86ISD. Actually it doesn't need to legalize. I change unpck to vectorshuffle and remove the splitting. The code looks simpler even with the additional mask setting code. Thanks for the suggestion. |
lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp | ||
---|---|---|
673 | Do you need MULHS here too? (Maybe missing a test for this?) | |
lib/Target/X86/X86ISelLowering.cpp | ||
26453 | This comment is confusing; the operand isn't actually guaranteed to be between 0 and 127. (The transform is safe because we can assume an appropriate number of leading sign/zero bits.) | |
26478 | Probably better to use APInt::getNumSignBits here. |
LGTM, but this has been though a lot of revisions, so it's probably a good idea to someone else to double-check I didn't miss something obvious.
LGTM too.
I have a couple of comments, but they're rather half-baked, feel free to ignore them if they don't make sense to you.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26446 | Nit - it's a bit misleading to have N->getNumOperands() here, and then index into an array of size 2. | |
26470 | Perhaps use ComputeNumSignBits on the BUILD_VECTOR elements as well, instead of checking for const/undef? | |
26480 | Maybe computeKnownBits instead of special-casing ZERO_EXTEND? |
Do you need MULHS here too? (Maybe missing a test for this?)