This commit adds support for scalable vector types in theComplexDeinterleaving
pass, allowing it to recognize and handle llvm.vector.interleave2 and
llvm.vector.deinterleave2 intrinsics for both fixed and scalable vectors
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-add.ll | ||
---|---|---|
102 | I am not sure if I should duplicate all fixed-width vector tests to reflect the fact that llvm.experimental.vector.deinterleave2 is now supported there. |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
259–260 | Might be worth adding a comment here saying what a "Deinterleave" is; It's not immediately clear that a deinterleave is either a specific intrinsic, or a shufflevector instruction. | |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
24641–24644 | When working with scalable vectors, they don't have the same restriction of bit width. Treating them with a max width of 128 bits seems wasteful and inefficient, is there any way to get the vector width at compile time (is there a target->getMaxVectorWidth() or something)? | |
llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-add.ll | ||
102 | It can't hurt, more coverage is always good. On the other hand, this is probably enough to test that the intrinsics work with fixed width cases. | |
103–108 | Is the top comment correct? Is this case actually expected to not transform? |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
352 | Does this need to run twice with and without IsScalable? It doubles the scanning of instructions, and seems unnecessary if it only modifies the shuffles/intrinsic matches, which are either both valid or mutually exclusive. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
24641–24644 | For the scalable vectors I don't think we want to use a min or max vector width, we should rather operate on the ElementCount and size of the ElementTypes I think. in that case maybe it is worth to have dedicated section for fixed width and scalable width vectors in this function? |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
23 | nit: typo | |
25 | nit: probably does not need to start with capital letter, maybe using "shufflevector instruction" would be more clear | |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
24628 | I think it could be worth to add extra run lines to the existing aarch64 tests which operate on fixed width vectors, to make sure that adding +sve does not stop generation of fcmla there. What do you think? |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
352 | I thought about it, but theoretically target could have only one (scalable or fixed vector) support. | |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
24628 | Added to complex-deinterleaving-f16-add.ll where we have both shufflevectors and deinterleave2 intrinsics applied to fixed-width vectors | |
24641–24644 | @NickGuy Actually, this functions returns false if VTyWidth is less than 128 bit, so any 128+ bit sized vectors are supported. @mgabka We support any unpacked type with size 2X if 2X >= 128, there is code in AArch64TargetLowering::createComplexDeinterleavingIR that splits those vectors until they have minimal size of 128 and then merges them back. We don't support min-64bit sized vectors (unlike Neon) and that's the condition I added to the if statement. | |
llvm/test/CodeGen/AArch64/complex-deinterleaving-f16-add.ll | ||
103–108 | Yes, indeed. Also fixed in the other places. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
24641–24644 | Not sure why I added the comment here, it was supposed to be on the if (TyWidth > 128) { below, oops... How resource-efficient is this splitting with scalable vectors though, my concern is that we'd split the operation across numerous 256+ vectors while only using the lower 128 bits of each. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
24641–24644 | Not sure if I understood your concern correctly. We are splitting any 256+ min-sized vector instructions to those that have minimum 128 bits. How many bits are going to be there depends on actual CPU, but generated code would work just fine even without knowing that information. For example, for <vscale x 8 x double> we'll get 4 instructions working on vectors of size <vscale x 2 x double> |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
352 | I'm not sure why that matters. Can you explain more? I would expect that we just match starting from whatever we find (shuffle or intrinsic) and isComplexDeinterleavingOperationSupported handles whether the actual type is supported (be it scalable or fixed length). The shuffle won't ever match a scalable vector, but that shouldn't be a problem as far as I understand. |
Make function evaluateBasicBlock to run once by moving scalable/fixed width support check to isComplexDeinterleavingOperationSupported.
Some minor clean ups.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
352 | Nevermind, I think you are right. Now evaluateBasicBlock runs only once an the check is isComplexDeinterleavingOperationSupported |
Thanks. I have some minor suggestions but otherwise LGTM.
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
263 | instruciton -> instruction | |
341 | I don't think this needs to call with a Scalable flag. The AArch64TargetLowering::isComplexDeinterleavingSupported routine can just return true if it has sve or complexnums. | |
923 | This could be turned into if (!RealShuffle || !ImagShuffle) return nullptr; LLVM often likes returning early to reduce indentation. |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
341 | The isComplexDeinterleavingSupported with a flag is used inside isComplexDeinterleavingOperationSupported to check if for given vector type (scalable or fixed width) we have required architecture extensions avaialble. However I realized that there is probably a bug here, as the SVE feature enables only scalable FCMLA, while scalable CMLA are available only when SVE2 feature is enabled, so I think we should make sure that isComplexDeinterleavingOperationSupported takes that into account and add extra testing. I guess what David suggest is to make "isComplexDeinterleavingSupported" function as generic as possible and leave all the detailed checks to isComplexDeinterleavingOperationSupported , what looks reasonable to me. |
llvm/lib/CodeGen/ComplexDeinterleavingPass.cpp | ||
---|---|---|
341 | CMLA are not a problem - we are checking that scalar type is not Int in AArch64TargetLowering::isComplexDeinterleavingOperationSupported |
nit: typo