This is an archive of the discontinued LLVM Phabricator instance.

[AARCH64] Enable STORE of v4i8 to help more vectorization opportunities
ClosedPublic

Authored by Carrot on Mar 8 2023, 2:02 PM.

Details

Summary

For the attached test case, currently llvm generates instructions to load/or/store the bytes one by one. Although NEON doesn't support v4i8 natively, we can mix v8i8 alu instructions with 32 bit load/store instructions to handle them, just like x86 backend. So this patch does the same thing as in x86, enable custom STORE of v4i8, and finally I can get vectorized instructions.

Diff Detail

Event Timeline

Carrot created this revision.Mar 8 2023, 2:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2023, 2:02 PM
Carrot requested review of this revision.Mar 8 2023, 2:02 PM

Hello. This looks like a nice idea. We have done some work to make v4i8 better in the recent past. I hadn't realized that the slp vectorizer wasn't making use of that for stores.

From what I can tell it looks like this is trying to use setOperationAction(ISD::STORE, MVT::v4i8, Custom) as way to get TargetTrasformInfo::getStoreMinimumVF to allow starting vectorization at v8i4 stores. We already make the truncstore custom in setTruncStoreAction(MVT::v4i16, MVT::v4i8, Custom) further down the file. Can we use that in getStoreMinimumVF instead? Either have it use isTruncStoreLegalOrCustom as opposed to isTruncStoreLegal, or override it in an AArch64 version of the method?

I tried isTruncStoreLegalOrCustom and it works, generates the same code as my patch. The final code is

ldr s0, [x1]
ldr s1, [x0]
ushll   v0.8h, v0.8b, #0       // *
ushll   v1.8h, v1.8b, #0       // *
orr v0.8b, v1.8b, v0.8b
xtn v0.8b, v0.8h               // *
str s0, [x0]
ret

But I still prefer my patch. In the final code you can see some redundant instructions, these are generated by SelectionDAG, which promotes v4i8 to v4i16, processes the data, and truncs it to v4i8 before store. If we widen the vector v4i8 to v8i8, all the extra instructions can be deleted. I will work on it in a separate patch.

Hi - sorry I must have misunderstood your last comment. I was under the impression that you were going to go and look into the other optimizations.

I don't think it is a good idea to change SDAG code generation (setOperationAction) in order to change the SLP cost model (getStoreMinimumVF). The setOperationAction isn't really valid on its own, considering we don't have a custom way of storing v4i8. It would be better to modify the costmodel directly through getStoreMinimumVF, either through the generic version or an AArch64 override.

Allen added a subscriber: Allen.Mar 22 2023, 4:34 AM

X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.

I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.

t2: i32 = bitcast t1:v4i8
t3: ch = store<store (s32) into %somewhere> t10, t2, address

But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.

X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.

I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.

t2: i32 = bitcast t1:v4i8
t3: ch = store<store (s32) into %somewhere> t10, t2, address

But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.

I believe that X86 will treat vector lanes differently to Arm/AArch64. For smaller types the vector will be widened by adding more elements (v4i8 [a,b,c,d] will become v8i8 [a,b,c,d,u,u,u,u]) as opposed to being promoted under aarch64 to larger sizes (v4i8 is promoted to v4i16, with the top half of each lane unused). They can both have their advantages and disadvantages. With SVE having t/b instructions the promotion can make more sense, and it is good to keep SVE and NEON inline. The same is true under MVE which only has 128bit vectors so more types are promoted, but this plays nicely into the how the t/b instructions operate.

I think it is worth separating the cost model controls for the SLP vectorizer and the codegen issues of the produced code.

Carrot updated this revision to Diff 508778.Mar 27 2023, 1:21 PM

Override getStoreMinimumVF to support vectorization of store v4i8 for AArch64.

X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.

I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.

t2: i32 = bitcast t1:v4i8
t3: ch = store<store (s32) into %somewhere> t10, t2, address

But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.

I believe that X86 will treat vector lanes differently to Arm/AArch64. For smaller types the vector will be widened by adding more elements (v4i8 [a,b,c,d] will become v8i8 [a,b,c,d,u,u,u,u]) as opposed to being promoted under aarch64 to larger sizes (v4i8 is promoted to v4i16, with the top half of each lane unused). They can both have their advantages and disadvantages. With SVE having t/b instructions the promotion can make more sense, and it is good to keep SVE and NEON inline. The same is true under MVE which only has 128bit vectors so more types are promoted, but this plays nicely into the how the t/b instructions operate.

What is t/b instructions?

dmgreen accepted this revision.Mar 28 2023, 2:04 AM

Thanks. LGTM

What is t/b instructions?

They are top/bottom instructions. They can do things like trunc and insert into the bottom half of each lane, to extend the top half into a full lane.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
402

Slight formatting on the VF >=4 -> VF >= 4, but maybe just use VF == 4 as I believe all other power2 sizes will already be handled?

This revision is now accepted and ready to land.Mar 28 2023, 2:04 AM
Matt added a subscriber: Matt.Mar 28 2023, 12:30 PM
Carrot updated this revision to Diff 509477.Mar 29 2023, 2:15 PM
Carrot marked an inline comment as done.

Thanks for the review! Will commit this version.

Carrot added a comment.Apr 5 2023, 9:57 AM

X86 backend also set store of v4i8 as custom. We have similar capability for v4i8.

I can also add a real custom way of lowering store of v4i8 as following. It's more simple and natural than storing of v4i16.

t2: i32 = bitcast t1:v4i8
t3: ch = store<store (s32) into %somewhere> t10, t2, address

But v4i8 is not a legal type, so we can't see a store v4i8 dag node, so it looks not necessary.

I believe that X86 will treat vector lanes differently to Arm/AArch64. For smaller types the vector will be widened by adding more elements (v4i8 [a,b,c,d] will become v8i8 [a,b,c,d,u,u,u,u]) as opposed to being promoted under aarch64 to larger sizes (v4i8 is promoted to v4i16, with the top half of each lane unused). They can both have their advantages and disadvantages. With SVE having t/b instructions the promotion can make more sense, and it is good to keep SVE and NEON inline. The same is true under MVE which only has 128bit vectors so more types are promoted, but this plays nicely into the how the t/b instructions operate.

Why with SVE t/b instructions the promotion of v4i8 -> v4i16 make more sense than the widen of v4i8 -> v8i8?

In my understanding the widen of v4i8 -> v8i8 usually is a nop, but the promotion of v4i8 -> v4i16 needs a real instruction.