Currently, Clang does not generate individual stores for update to its elements. For code below:
typedef float v4sf __attribute__ ((vector_size(16))); void foo(v4sf *a) { (*a)[0] = 1; (*a)[3] = 2; }
LLVM generates a shuffle instr for it, even if there's only one element updated. But GCC will generate individual stores (at least on PowerPC).
Also, if we have a chain of shufflevector/insertelement instrs, we can go through it, track status of each element and find which updated, finally replace original vector store into multiple element stores. This patch will do it.
This optimization happens at DAGCombiner, since each platform can easily set rules about turning it own in own version of hook method. Steps of the optimization are:
- Start at a vector store, go up through its value operand, until we find a load.
- In path from store to the load, we only accept insert/shuffle as operands.
- Track value modification from the load the store. Quit if we need to extract from other vectors.
- Generate store of elements changed in the path, to replace original vector store.
A target-related method isCheapToSplitStore is created. So only PowerPC platform turns the optimization on now.
Discussion: http://lists.llvm.org/pipermail/llvm-dev/2019-September/135432.html http://lists.llvm.org/pipermail/llvm-dev/2019-October/135638.html
This should be more descriptive. Perhaps:
Furthermore, I don't think this query is useful as implemented. For most targets, it is almost guaranteed that this should return false when NumSplit > 2 and quite likely even with NumSplit == 2 is not cheaper than a single vector store.
The problem is that there is not context to determine what we would be saving if we were to split this up. If we have some sequence of operations on a vector and then we need to store that vector either with a single vector store or split into NumSplit pieces, the answer is clearly - don't split it (one store is better than many).