For the attached test case, currently llvm generates instructions to load/or/store the bytes one by one. Although NEON doesn't support v4i8 natively, we can mix v8i8 alu instructions with 32 bit load/store instructions to handle them, just like x86 backend. So this patch does the same thing as in x86, enable custom STORE of v4i8, and finally I can get vectorized instructions.
Details
Diff Detail
Event Timeline
Hello. This looks like a nice idea. We have done some work to make v4i8 better in the recent past. I hadn't realized that the slp vectorizer wasn't making use of that for stores.
From what I can tell it looks like this is trying to use setOperationAction(ISD::STORE, MVT::v4i8, Custom) as way to get TargetTrasformInfo::getStoreMinimumVF to allow starting vectorization at v8i4 stores. We already make the truncstore custom in setTruncStoreAction(MVT::v4i16, MVT::v4i8, Custom) further down the file. Can we use that in getStoreMinimumVF instead? Either have it use isTruncStoreLegalOrCustom as opposed to isTruncStoreLegal, or override it in an AArch64 version of the method?
I tried isTruncStoreLegalOrCustom and it works, generates the same code as my patch. The final code is
ldr s0, [x1] ldr s1, [x0] ushll v0.8h, v0.8b, #0 // * ushll v1.8h, v1.8b, #0 // * orr v0.8b, v1.8b, v0.8b xtn v0.8b, v0.8h // * str s0, [x0] ret
But I still prefer my patch. In the final code you can see some redundant instructions, these are generated by SelectionDAG, which promotes v4i8 to v4i16, processes the data, and truncs it to v4i8 before store. If we widen the vector v4i8 to v8i8, all the extra instructions can be deleted. I will work on it in a separate patch.
Hi - sorry I must have misunderstood your last comment. I was under the impression that you were going to go and look into the other optimizations.
I don't think it is a good idea to change SDAG code generation (setOperationAction) in order to change the SLP cost model (getStoreMinimumVF). The setOperationAction isn't really valid on its own, considering we don't have a custom way of storing v4i8. It would be better to modify the costmodel directly through getStoreMinimumVF, either through the generic version or an AArch64 override.
X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.
I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.
t2: i32 = bitcast t1:v4i8 t3: ch = store<store (s32) into %somewhere> t10, t2, address
But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.
I believe that X86 will treat vector lanes differently to Arm/AArch64. For smaller types the vector will be widened by adding more elements (v4i8 [a,b,c,d] will become v8i8 [a,b,c,d,u,u,u,u]) as opposed to being promoted under aarch64 to larger sizes (v4i8 is promoted to v4i16, with the top half of each lane unused). They can both have their advantages and disadvantages. With SVE having t/b instructions the promotion can make more sense, and it is good to keep SVE and NEON inline. The same is true under MVE which only has 128bit vectors so more types are promoted, but this plays nicely into the how the t/b instructions operate.
I think it is worth separating the cost model controls for the SLP vectorizer and the codegen issues of the produced code.
Thanks. LGTM
What is t/b instructions?
They are top/bottom instructions. They can do things like trunc and insert into the bottom half of each lane, to extend the top half into a full lane.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h | ||
---|---|---|
402 ↗ | (On Diff #508778) | Slight formatting on the VF >=4 -> VF >= 4, but maybe just use VF == 4 as I believe all other power2 sizes will already be handled? |
Why with SVE t/b instructions the promotion of v4i8 -> v4i16 make more sense than the widen of v4i8 -> v8i8?
In my understanding the widen of v4i8 -> v8i8 usually is a nop, but the promotion of v4i8 -> v4i16 needs a real instruction.