This patch uses the feature added in D79162 to fix the cost of a sext/zext of a masked load, or a trunc for a masked store.

Previously, those were considered cheap or even free, but it's absolutely not the case if the cast's result type doesn't fit in a 128 bits register. They're expensive!

Examples:

The cast fits in a 128 bits register:

// LLVM define dso_local arm_aapcs_vfpcc <8 x i16> @square(<8 x i8>*, <8 x i8>, <8 x i8>) #0 { %mask = trunc <8 x i8> %1 to <8 x i1> %res = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8 (<8 x i8>* %0, i32 4, <8 x i1> %mask, <8 x i8> %2) %ext = sext <8 x i8> %res to <8 x i16> ret <8 x i16> %ext } // ASM vpt.i32 ne, q0, zr vldrbt.s16 q0, [r0] vmovlb.s8 q1, q1 vpsel q0, q0, q1

The cast doesn't fit in a 128 bits register:

// LLVM define dso_local arm_aapcs_vfpcc <8 x i32> @square(<8 x i8>*, <8 x i8>, <8 x i8>) #0 { %mask = trunc <8 x i8> %1 to <8 x i1> %res = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8 (<8 x i8>* %0, i32 4, <8 x i1> %mask, <8 x i8> %2) %ext = sext <8 x i8> %res to <8 x i32> ret <8 x i32> %ext } // ASM vpt.i32 ne, q0, zr vldrbt.u16 q0, [r0] vpsel q1, q0, q1 vmov.u16 r0, q1[0] vmov.32 q0[0], r0 vmov.u16 r0, q1[1] vmov.32 q0[1], r0 vmov.u16 r0, q1[2] vmov.32 q0[2], r0 vmov.u16 r0, q1[3] vmov.32 q0[3], r0 vmov.u16 r0, q1[4] vmov.32 q2[0], r0 vmov.u16 r0, q1[5] vmov.32 q2[1], r0 vmov.u16 r0, q1[6] vmov.32 q2[2], r0 vmov.u16 r0, q1[7] vmov.32 q2[3], r0 vmovlb.s8 q0, q0 vmovlb.s8 q1, q2 vmovlb.s16 q0, q0 vmovlb.s16 q1, q1

I've updated the costs to better reflect reality, and added a test for it in `test/Analysis/CostModel/ARM/cast.ll`.

I've also added a vectorizer test that showcases the improvement: in some cases, the vectorizer will now choose a smaller VF when tail-predication is enabled, which results in better codegen. (Because if it were to use a higher VF in those cases, the code we see above would be generated, and the vmovs would block tail-predication later in the process, resulting in very poor codegen overall)

Please note that the contents of this patches are subject to changes depending on the outcome of the review of D79162, but the cost calculation logic shouldn't change too much (just the way this case is detected).