This adds the cost of an i1 extract and a branch to the cost in getMemInstScalarizationCost when the instruction is predicated. These predicated loads/store would generate blocks of something like:
%c1 = extractelement <4 x i1> %C, i32 1 br i1 %c1, label %if, label %else if: %sa = extractelement <4 x i32> %a, i32 1 %sb = getelementptr inbounds float, float* %pg, i32 %sa %sv = extractelement <4 x float> %x, i32 1 store float %sa, float* %sb, align 4 else:
So this increases the cost by the extract and branch. This is probably still too low in many cases due to the cost of all that branching, but there is already an existing hack increasing the cost using useEmulatedMaskMemRefHack. It will increase the cost if it is a load or there are more than one store. This patch improves the cost for when there is only a single store improving the attached test.
(The hack can hopefully be removed at some point. I think it would be OK to remove from the ARM testing I tried, but there are a number of X86 tests that suggest it's still needed, and the cost of an i1 extract + branch is probably too low to accurately represent the cost of trying to perform all these branches).
Nit: I see there's precedent of not doing it in this file, but
would have helped readability for me.