This fixes a subtle issue that was introduced in r205153 [0].
When reusing a store for the extractelement expansion (to load directly from it, inserting of going through the stack), later stores to the same location might have overwritten the data we were expecting to extract from. To fix that, we need to explicitly replace the chain going out of the reused store, so that later stores also depend on the generated element-extracting loads.
Here's a small annotated testcase, convoluted as it may be (any simpler and we're able to realize this is just a vector shuffle):
typedef int32_t v4i32 __attribute__ ((vector_size (16))); void test(int32_t *x, int32_t *y, int i) { v4i32 b = (*(v4i32*)y); x[i*4 + 1] = b[i*4 + 0]; x[i*4 + 0] = b[i*4 + 1]; }
Let's assume i is dynamically always 0. Consider the case where x == y. The function then does something like the swap pattern:
b = *x; x[1] = b[0]; x[0] = b[1];
The compiler used to generate:
movl 8(%ebp), %eax ; eax <- x movl 16(%ebp), %ecx ; ecx <- i movl 12(%ebp), %edx ; edx <- y movaps (%edx), %xmm0 ; b <- *y movaps %xmm0, -40(%ebp) ; tmp = b shll $4, %ecx ; ecx <- ecx * 4; ecx = i = 0 leal -40(%ebp), %edx ; movl (%ecx,%edx), %edx ; edx <- tmp[0] movl %edx, 4(%eax,%ecx) ; x[1] = tmp[0] = b[0] movaps %xmm0, -24(%ebp) ; tmp2 = b leal -24(%ebp), %edx ; movl 4(%ecx,%edx), %edx ; edx <- tmp2[1] movl %edx, (%eax,%ecx) ; x[0] = tmp2[1] = b[1]
The new code however is sometimes problematic:
movl 8(%ebp), %eax ; eax <- x movl 16(%ebp), %ecx ; ecx <- i movl 12(%ebp), %edx ; edx <- y movaps (%edx), %xmm0 ; b <- *y movaps %xmm0, -24(%ebp) ; tmp = b shll $4, %ecx ; ecx <- ecx * 4; ecx = i = 0 leal -24(%ebp), %edx ; movl (%ecx,%edx), %esi ; esi <- tmp[0] movl %esi, 4(%eax,%ecx) ; movl 4(%ecx,%edx), %edx ; edx <- tmp[0] movl %edx, (%eax,%ecx) ; x[0] = tmp[0]
Broadcasting x[0].
[0] Quoting:
Make use of previously generated stores in SelectionDAGLegalize::ExpandExtractFromVectorThroughStack When expanding EXTRACT_VECTOR_ELT and EXTRACT_SUBVECTOR using SelectionDAGLegalize::ExpandExtractFromVectorThroughStack, we store the entire vector and then load the piece we want. This is fine in isolation, but generating a new store (and corresponding stack slot) for each extraction ends up producing code of poor quality. When we scalarize a vector operation (using SelectionDAG::UnrollVectorOp for example) we generate one EXTRACT_VECTOR_ELT for each element in the vector. This used to generate one stored copy of the vector for each element in the vector. Now we search the uses of the vector for a suitable store before generating a new one, which results in much more efficient scalarization code.