pshufb can shuffle in zero bytes as well as bytes from a source vector - we can use this to avoid having to shuffle 2 vectors and ORing the result when the used inputs from a vector are all zeroable.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Hi Simon,
Look mostly good to me, I just have one concern with the current structure that I found error prone in case we need to update it.
See my inlined comments.
Thanks,
-Quentin
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
9602 ↗ | (On Diff #17891) | This is just a suggestion. int V1Idx = ((Mask[i] < 16) ? Mask[i] : 0x80); int V2Idx = ((Mask[i] < 16) ? 0x80 : Mask[i] - 16); if (Zeorable[i]) V1Idx = V2Idx = 0x80; |
9605 ↗ | (On Diff #17891) | I would introduce a constant for 0x80 instead of having it spread. |
9606 ↗ | (On Diff #17891) | We already know this from the ‘?:’ statements. |
9611 ↗ | (On Diff #17891) | I would structure this and the following if a bit differently. Currently we have: // do1 if (!B) return A } if (B) { // do2 if (!A) return B return //do3 } I would do => // do1 if (B) // do2 if (A && B) return // do3 return A ? A : B; |
Thanks Quentin.
A basic timing test of the pshufb vs 2xpshufb+por core loop gave a 30% improvement on my older Core2Duo machine (I guess due to throughput limitations), but this diminished to less than 5% on SandyBridge. However, its main use is the reduction in register pressure, as well as the obvious fact that it was pointlessly shuffling zero vectors.