Another opportunity to reduce masked stores: in D16691, we decided not to attempt the 'one mask element is set' transform in InstCombine, but I think this should be a win for any AVX machine.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Please can you add tests for i64 or f64 vectors as well? Also cases where the element is not from the lower 128-bits of a ymm/zmm.
Some minor comments. I wonder if this could be moved to DAGCombiner - possibly with a test for scalar store legality and TLI.isExtractVectorElementCheap()?
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26727 ↗ | (On Diff #46713) | This can probably be brought inside reduceMaskedStoreToScalarStore as a helper predicate |
26730 ↗ | (On Diff #46713) | Don't we have cases where the build vector input operands are implicitly truncated to i1? For instance vector constant folding is likely to have created legal types that are larger. |
I know that on AVX, VMASKMOV does not imply the non-temporal hint. However, what about targets that implement masked stores as non-temporal stores of selected bytes? For those targets, not preserving the non-temporal hint when converting the mask store into a scalar store would affect the hardware write combining logic.
That said, my knowledge is only limited to x86, so I cannot say that those targets exist.. What I am trying to say is that we probably have to be careful if we decide to move this transformation into the DAGCombiner as different targets may expand those nodes in a slightly different way (not sure if this makes sense..).
Basically, I recommend leaving it for now as a target specific combine (just my opinion).
I have preserved the non-temporal hint of the original masked store in the scalar store with this patch, but...
That said, my knowledge is only limited to x86, so I cannot say that those targets exist.. What I am trying to say is that we probably have to be careful if we decide to move this transformation into the DAGCombiner as different targets may expand those nodes in a slightly different way (not sure if this makes sense..).
Basically, I recommend leaving it for now as a target specific combine (just my opinion).
That was my thinking too. Let's make sure this is sane across x86 first. There's a lot of potential variation in how a masked op is implemented. I'll add a 'TODO' comment to remind us that this could be lifted to DAGCombiner once we have more confidence.
lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
26727 ↗ | (On Diff #46713) | Yes - good idea. My lambda skills are non-existent. |
26730 ↗ | (On Diff #46713) | I'm not sure about the interaction between legalization and the IR intrinsic. I was expecting that we always get here with the IR-defined form. How about I add another 'TODO' for now? I think supporting different mask types may require careful logic because ISD::MSTORE doesn't actually define the mask format AFAICT. |
Patch updated:
- Converted getOneTrueElt() into a lambda.
- Added TODO comment about hoisting this to DAGCombiner.
- Added TODO comment about supporting other mask types (but if we do, what is the format for those masks?)
- Added test cases to demonstrate extraction from different types, high elements, and different sizes of vectors.
On 2nd thought, we're going to need that exact function for the sibling patch for masked loads. I suppose it's good form to keep it this way for now, but I think it's going to get pulled back out soon enough.