This patch teaches (ARM|AArch64)ISelLowering.cpp to match illegal vector types to interleaved access intrinsics as long as the types are multiples of the vector register width (128 bits). A "wide" access will now be mapped to multiple interleave intrinsics similar to the way in which non-interleaved accesses with illegal types are legalized into multiple accesses. For example, given an interleaved access whose sub-vectors are 256 bits wide, the patch would generate 2 consecutive interleaved memory accesses.
The primary motivation is the vectorization of "mixed-type" loops, such as the one shown below.
f(char *A, int *B, unsigned N) { for (unsigned i = 0; i < N; i += 3) { B[i + 0] = A[i + 0] B[i + 1] = A[i + 1] B[i + 2] = A[i + 2] } }
Here, we load char data (i8) and then store it as int data (i32). We'd like to set the loop vectorization factor based on the smaller type, rather than the larger one (we can do this today using the -vectorizer-maximize-bandwidth flag). Let the vectorization factor be 16 in this case for the <16 x i8> data. If we do this, the stored vector type becomes wider than is legal. If we had stride-one accesses this is fine - type legalization will split it up. But for the interleaved accesses we have here, we currently won't be able to map what the vectorizer generates to the proper interleave intrinsics because the type is too wide. Please see the test cases for more concrete examples.
I'll update the associated TTI costs (in getInterleavedMemoryOpCost) as a follow-on patch.