This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Match interleaved memory accesses into ldN/stN instructions.
AbandonedPublic

Authored by HaoLiu on Jun 9 2015, 2:13 AM.

Details

Summary

Hi,

The Loop Vectorizor can identify and generate interleaved memory access in the middle end.

E.g. for (i = 0; i < N; i+=2) {
       a = A[i];         // load of even element
       b = A[i+1];       // load of odd element
       ...               // operations on a, b, c, d
       A[i] = c;         // store of even element
       A[i+1] = d;       // store of odd element
     }
  
  The loads of even and odd elements are identified as an interleave load group, which will be transfered into
     %wide.vec = load <8 x i32>, <8 x i32>* %ptr
     %vec.even = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
     %vec.odd = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
  
  The stores of even and odd elements are identified as an interleave store group, which will be transfered in
     %interleaved.vec = shufflevector <4 x i32> %vec.even, %vec.odd, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i3
     store <8 x i32> %interleaved.vec, <8 x i32>* %ptr

This patch try to identify such interleaved load/store and match them into AArch64 ldN/stN intrinsics in the backend. Such intrinsics can be directly selected into ldN/stN instructions in the CodeGen phase.
A new function pass AArch64InterleavedAccess is added. I don't implement such functionality in the CodeGen phase, because:

The DAG node VECTOR_SHUFFLE requires the result type is the same as the operand type. One shufflevector IR maybe transfered into several nodes on DAG and the mask is changed as well. It's more difficult to identify such pattern.
Another reason is CodeGen can only see one basic block.

This pass is default to be disabled as the Loop Vectorize currently disables optimization on interleaved accesses by default. We can enable them in the future after fully test.

Also, I have a similar patch for ARM backend. I'll send it out if this patch is approved. Otherwise it is unnecessary to review that patch.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

HaoLiu updated this revision to Diff 27363.Jun 9 2015, 2:13 AM
HaoLiu retitled this revision from to [AArch64] Match interleaved memory accesses into ldN/stN instructions..
HaoLiu updated this object.
HaoLiu edited the test plan for this revision. (Show Details)
HaoLiu added reviewers: t.p.northover, rengolin, ab.
HaoLiu added a subscriber: Unknown Object (MLST).
HaoLiu updated this revision to Diff 27364.Jun 9 2015, 2:18 AM

Update a patch fixes a typo.

jmolloy requested changes to this revision.Jun 10 2015, 3:56 AM
jmolloy added a reviewer: jmolloy.
jmolloy added a subscriber: jmolloy.

Hi Hao,

I have some comments below. I think it would have been nicer if we'd managed to get this into CodeGen, but I do see the difficulties there. It's hard enough to match at the IR level!

My larger concern is how we're going to do this for ARM32 too without duplicating the entire pass... If there were somewhere we could put a templated version of this pass that ARM could use too, that would be great. I'm not a fan of copy-paste on this scale.

Cheers,

James

lib/Target/AArch64/AArch64InterleavedAccess.cpp
61

This would probably be easier simply as two const ints. Enum is overkill.

162

I would expect here you to check whether SVI is in DeadInsts, as we may have identified it earlier. If you don't, I'd expect we'd create multiple LDN instructions unnecessarily!

173

The order of these checks doesn't seem the most efficient to me. You have a load that has at least one ShuffleVector using it. You're checking the ShuffleVector is a valid mask first, then checking all other Load users.

I'd expect the code would be simpler if:

  1. Check the illegal vector type size.
  2. Check the load only had ShuffleVector users.
  3. Check each ShuffleVector is a strided index.

I guess what I think is strange is that this function takes a ShuffleVector as its argument, rather than a LoadInst.

237

This is weird - surely you should just put it into DeadInsts?

251

Please move this static function up to where all the other static functions are.

I got confused initially how this function was different to isStridedMask above. The difference is that this function checks for a mask that RE-interleaves, whereas isStridedMask checks for a mask that DE-interleaves. Please make that more clear in the function name / comments.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
416

Why does a ld2 have a different cost to a ld3 here?

This revision now requires changes to proceed.Jun 10 2015, 3:56 AM
HaoLiu edited edge metadata.

Hi James,

Thanks for your comments. I've updated a new patch.

About your concern, the ARM32 patch is similar to this patch. I agree with you that we'd better share the code. But I couldn't find a place for sharing such backend code.

Thanks,
-Hao

lib/Target/AArch64/AArch64InterleavedAccess.cpp
61

Reasonable.

162

Reasonable.

173

I've updated the code.

The order is different, I firstly check 2, and then check 1, and 3 (They are all about one shufflevector).

Using a ShuffleVector as argument is to make this pass more efficient. I just think a program should have much more loads/stores than shufflevectors.

237

Done.

251

Renamed and updated the comments.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
416

This is just a rough cost.

The A57 optimisation guide says ldN instructions with different N and types have different cost, ranging from 6 to 11. But generally larger N means higher cost, so I use the factor as the cost.

I think it needs tuning in the future, but now it can work well. The following cost for N > 4 also needs tuning.

jmolloy accepted this revision.Jun 11 2015, 1:02 AM
jmolloy edited edge metadata.

Hi Hao,

Thanks for making those changes. With the few typographical comments below corrected, this LGTM.

Thanks!

James

lib/Target/AArch64/AArch64InterleavedAccess.cpp
2

Something weird's gone on with this comment line. It's not in the standard form.

30

You use the word "transfer" but actually the right word is "transform". Could you please find-and-replace?

53

We don't use #defines. Suggest you use a static const int MAX_FACTOR = 2; instead.

196

consists -> consisting

lib/Target/AArch64/AArch64TargetMachine.cpp
72

in -> "in the"

This revision is now accepted and ready to land.Jun 11 2015, 1:02 AM
rengolin edited edge metadata.Jun 11 2015, 3:21 AM

I have some comments below. I think it would have been nicer if we'd managed to get this into CodeGen, but I do see the difficulties there. It's hard enough to match at the IR level!

There are two things we should common up: the static methods and AArch64InterleavedAccess.

The first one is easier, but the second one might be more important. I think it should be fine to add an InterleavedAccess.cpp into lib/CodeGen and have only AArch64 for now. Later, when we do ARM, you can rebase the AARch64 into a generic class, once we know more what the similarities are.

Meanwhile, that cpp file will contain the static methods, so when we do have an ARM version, we can easily reuse.

cheers,
--renato

Hi Hao,

Thanks for the patch. The general idea sounds good to me, but I have some questions regarding some implementation details. To address them, it would actually be helpful to see the ARM32 patch too, since I expect that a lot of the stuff can be reused, as James and Renato have already mentioned.

For instance, functions like isReInterleaveMask look totally target independent. Would it be possible to make a target-independent implementation with a bunch of target hooks (like getInterleavedStoreIntrinsic(Factor, Size))? Then, ARM32 and ARM64 patches would differ only in those hooks, while the rest of the code would be reused.

lib/Target/AArch64/AArch64InterleavedAccess.cpp
147–149
if (Mask.size() < 2)
  return false;

?

248

Is there a need to redeclare Index here?

280

One more redeclaration of Index.

352

Why 5?

Hi Hao,

Another comment from me as well.

Thanks,
Silviu

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
415

If we get a a legal type here that cannot be matched by the pass (for example v4i8) this will produce the wrong cost.
I think we also need to check that the type size is either 64 or 128.

sbaranga added inline comments.Jun 12 2015, 8:46 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
415

I've looked into this some more, and it seems that the issue is that the size of VecTy is the size of the entire interleaved group. So I think what needs to change here is the check to isTypeLegal to something like:

isTypeLegal(VectorType::get(VecTy->getScalarType(), VecTy->getVectorNumElements()/Factor));

HaoLiu edited edge metadata.

Hi,

I've refactored the patch according to comments from Michael and Silviu.

As the ARM backend code is ready, I also added ARM code in this patch, so that we can compare and figure out how to share code between AArch64 backend and ARM backend.

As you can see, most of the static functions can be shared. But the problem is the code in matchInterleavedLoad()/matchInterleavedStore() can not be shared. The call of ldN and vldN intrinsics are different:

AArch64 backend:
    %vec = tail call { <8 x i8>, <8 x i8> } @llvm.aarch64.neon.ld2.v8i8.p0v8i8(<8 x i8>* %ptr)
    call void @llvm.aarch64.neon.st2.v16i8.p0i8(<16 x i8> %A, <16 x i8> %B, i8* %ptr)
ARM backend:
    %vec = tail call { <8 x i8>, <8 x i8> } @llvm.arm.neon.vld2.v8i8(i8* %ptr, i32 1)
    call void @llvm.arm.neon.vst2.v8i8(i8* %ptr, <8 x i8> %A, <8 x i8> %B, i32 1)

The ldN and vldN have different parameters. So It's hard to share the code about creating intrinsic calls. We can introduce new intrinsics like llvm.neon.ldN for both AArch64 and ARM, but it means we also need to modify all the code related to the old ldN/vldN intrinsics. Do you have any ideas about problem?

About the place, as Renato and James suggested, I think lib/CodeGen is a reasonable place. I think maybe we can put code in the lib/CodeGen/CodeGenPrepare.cpp, which already has a function called CodeGenPrepare::OptimizeShuffleVectorInst(ShuffleVectorInst *SVI). This function currently optimizes BroadcastShuffle for X86.

Thanks,
-Hao

lib/Target/AArch64/AArch64InterleavedAccess.cpp
147–149

Fixed.

248

Removed the re-declaration.

280

Removed re-declaration.

352

We have at most 5 operands for calling ldN intrinsics (The most one: The ld4 call needs 4 vectors and 1 pointer as arguments).

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
415

Silviu, you are right. Fixed in the new patch.

ping...

Do you guys have any idea about sharing the matchInterleavedLoad()/matchInterleavedStore()?

If we can not share such function, I think we have to use two pass ARMInterleavedAccess and AArch64InterleavedAccess separately for ARM and AArch64 backends.

Thanks,
-Hao

HaoLiu abandoned this revision.Jun 18 2015, 3:08 AM

I've created a new patch to share code between ARM and AArch64 backends. Abondon this patch.