This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CodeGen] Lower (de)interleave2 intrinsics to ld2/st2

Authored by huntergr on Mar 16 2023, 4:53 AM.



The InterleavedAccess pass currently matches (de)interleaving
shufflevector instructions with loads or stores, and calls into
target lowering to generate ldN or stN instructions.

Since we can't use shufflevector for scalable vectors (besides a
splat with zeroinitializer), we have interleave2 and deinterleave2
intrinsics. This patch extends InterleavedAccess to recognize those
intrinsics and if possible replace them with ld2/st2 via target lowering.

Unlike the fixed-length version, we currently cannot 'legalize' the
operation in IR because we don't have a way of concatenating or
splitting vectors at a scalable point, so for now we just bail
out if the types won't match the actual hardware instructions.

Diff Detail

Event Timeline

huntergr created this revision.Mar 16 2023, 4:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2023, 4:53 AM
huntergr requested review of this revision.Mar 16 2023, 4:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2023, 4:53 AM

Probably best to also check for isSimple() to match the behaviour of lowerInterleavedLoad.


Probably best to also check for isSimple() to match the behaviour of lowerInterleavedStore.


I'm happy for incremental bring up but I would like to see fixed length vectors supported sooner rather than later so that we have the option to use the new shuffle intrinsics without losing this feature. I'm specifically interested in the potential to simplify SVE VLS so that it mirrors SVE VLA


This will have the effect of "moving" the load to where the deinterleaving is happening? which has the potential to break the original IR's load/store order.


This is not sufficient because it'll allow <vscale x 128 x i1>, <vscale x 64 x i2> etc. Perhaps it's better to explicitly check for the types we do support?


As above, this might create the new "store" in the wrong place.

161 ↗(On Diff #505768)

Please can you add tests for ptr vectors as well.

Matt added a subscriber: Matt.Mar 25 2023, 1:25 PM
huntergr updated this revision to Diff 515641.Apr 21 2023, 1:41 AM
  • Stricter checking before trying to replace load/store+intrinsic pairs
  • Supports fixed-length vectors
  • Will now 'legalize' (de)interleave operations on larger (power-of-two) vectors into multiple ld2/st2 operations with appropriate insert/extract subvector operations.
  • More tests.
huntergr marked 7 inline comments as done.Apr 21 2023, 1:43 AM
mgabka added a subscriber: mgabka.Apr 24 2023, 1:50 AM
paulwalker-arm added inline comments.Jun 1 2023, 9:34 AM

Function names should start with a lower case letter?


As above.


Perhaps this is better done in isLegalInterleavedAccessType? as it cannot really be considered legal if the necessary target features are not available. You likely also want to use hasSVEorSME because SME supports these instructions as well.


I suspect this patch will ICE the compiler when faced with SVE fixed length vectors and there's no tests to prove otherwise. The nuance here is the result vectors are fixed length but the target intrinsics will generate scalable vectors.

If you look at lowerInterleavedLoad you'll see the extra complexity relating to working with a container type.

With that said, I'm happy for this patch to not support SVE fixed length vectors, it just cannot trigger an ICE. This might be as simple as adding a bail out for UseScalable != isa<ScalableVector>(VTy) and adding a few tests.


I think just ConstantInt::getTrue(LdTy->getContext()) should work here?


Same comment as with lowerDeinterleaveIntrinsicToLoad.


I don't really see what you're gaining by passing in Address rather than just passing in SI? I guess similar is true for lowerDeinterleaveIntrinsicToLoad.



18–19 ↗(On Diff #515641)

How about returning %deinterleave instead of the floating extractvalue's, or do they exist to test something specific?

190–192 ↗(On Diff #515641)

I think it's better for NEON to have its own test file. Likewise for SVE fixed length support, which I suspect doesn't work.

huntergr updated this revision to Diff 529536.Jun 8 2023, 1:57 AM

Made recommended changes to the lowering code.

Split the test file into scalable and fixed test files, added a RUN line to the fixed test file that uses -force-streaming-compatible-sve in order to test fixed SVE. The actual transformation doesn't occur yet but we have coverage now.

huntergr marked 8 inline comments as done.Jun 8 2023, 2:00 AM

I thought opaque pointers meant we didn't need to bitcast pointers anymore? I mainly mention it because I wasn't sure what type the following CreateGEP call returns and so figured we might either not need any bitcasting or we might need a little more :)


Is this correct? VTy is the result type for each of the two results from the deinterleave intrinsic. LdTy is this divided by the number of of calls to ld2 that you'll need, which represents the result type for each of the two results from the ld2 intrinsic. However ld2 reads 2 * sizeof(LdTy) bytes. So I think Offset is half the amount it needs to be.

I wonder if you can simplify the addressing logic by just using LdTy directly. That way the offset can just be I * Factor for both fixed and scalable vectors, thus no need for vscale. What do you think?

I've not looked but I'm assuming the store function has the same problem.

15222 says that for LLVM 17 "Typed pointers are not supported", so I think the bitcasts can be removed?

huntergr updated this revision to Diff 532846.Jun 20 2023, 3:15 AM
  • Removed pointer casts. I based the new code a little too closely on the existing shufflevector lowering code.
  • Fixed offset calculation to include the interleave Factor.
  • Also fixed another bug -- BaseAddr was overwritten with the new GEP value, which increases the offset for later memory operations instead of using the size per load/store. All GEPs now use the original pointer as a base.
huntergr marked 3 inline comments as done.Jun 20 2023, 3:17 AM
paulwalker-arm accepted this revision.Jun 23 2023, 4:40 AM

A few recommendations but otherwise looks good.


Not sure there's a perfect way to write this but I think to be a little more accurate you want something like:

if (!VecTy->isScalableTy() && !Subtarget->hasNEON())
  return false;

if (VecTy->isScalableTy() && !Subtarget->hasSVEorSME()))
  return false;

Probably worth an assert(Factor >= 2 && Factor <= 4);. Same goes for getStructuredStoreFunction.


Up to you but given you've got an IRBuilder you could use Builder.getInt64() to reduce the line wrapping for all the places where you're using ConstantInt::get(Type::getInt64Ty(VTy->getContext()),.....


Is this code still relevant? It looks like leftovers from before you started to pass SI as an operand.

This revision is now accepted and ready to land.Jun 23 2023, 4:40 AM
mgabka added inline comments.Jun 26 2023, 3:07 AM

perhaps adding a comment to this and the other function below would be a good idea, at least consistent with the other lowerinterleaved* functions


could you explain why you are checking it here but not for lowerDeinterleaveIntrinsic?


I think it is worth to add a comment here explaining why we restrict this operation only to VF=2


is it correct to set it here to true or only if the condition below is true?

from what I can see, before UseScalable was only set to true if this function was returning true


is "-force-streaming-compatible-sve" needed here? I thought that this transformation should happen always for sve, the tests for scalable vectorization have only "target-features"="+sve" added. am I missing something?

huntergr marked 7 inline comments as done.Jun 26 2023, 6:41 AM
huntergr added inline comments.

For deinterleave, I check that the load has a single use so that I can replace both it and the intrinsic -- so the deinterleave intrinsic itself (or a target specific intrinsic which produces the same result) can have multiple uses without problems.

Here, we want to make sure that the store is the only use of the interleave so that both can be replaced.


It's fine -- if the type is not a legal shuffle type, no transformation will be performed.


This flag allows us to explicitly force the use of SVE ldN/stN instructions for fixed-width vectors. If you look at AArch64TargetLowering::isLegalInterleavedAccessType(), you will see it make a call to Subtarget->forceStreamingCompatibleSVE() as one option for a condition to mark UseScalable = true;

I think the other way to do it would be to set the minimum SVE vector size to 256b, but that only works if the vector types used in the test are >128b. So using the flag means I can force it for all fixed-width tests.

This revision was landed with ongoing or failed builds.Jun 26 2023, 6:41 AM
This revision was automatically updated to reflect the committed changes.
huntergr marked 2 inline comments as done.
mgabka added inline comments.Jun 26 2023, 6:50 AM

if callers check correctly for the return value of this function.
However, the UseScalable is already set to false, even if the function returns false, my point is that this function should modify state of UseScalable only if it returns true.
But that is a just my personal preference I guess.

paulwalker-arm added inline comments.Jun 26 2023, 7:00 AM

The first part is the function's design (i.e. UseScalable only contains meaning data when the function returns true). That said, this function is not great as it tries to do two different jobs. Certainly worth a redesign if there's chance to refactor the current way fixed length vectors are implemented.

huntergr added inline comments.Jun 26 2023, 7:20 AM

I agree with Paul that it conflates two jobs, but I did just think of a nicer way to do it -- return either a pair of bools or an optional<bool> instead of passing in a reference to the variable. I can make that a fixup commit if you'd like.