This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Add a pass to recognize VLS strided loads/store from gather/scatter.
ClosedPublic

Authored by craig.topper on Aug 9 2021, 4:44 PM.

Details

Summary

For strided accesses the loop vectorizer seems to prefer creating a
vector induction variable with a start value of the form
<i32 0, i32 1, i32 2, ...>. This value will be incremented each
loop iteration by a splat constant equal to the length of the vector.
Within the loop, arithmetic using splat values will be done on this
vector induction variable to produce indices for a vector GEP.

This pass attempts to dig through the arithmetic back to the phi
to create a new scalar induction variable and a stride. We push
all of the arithmetic out of the loop by folding it into the start,
step, and stride values. Then we create a scalar GEP to use as the
base pointer for a strided load or store using the computed stride.
Loop strength reduce will run after this pass and can do some
cleanups to the scalar GEP and induction variable.

Diff Detail

Event Timeline

craig.topper created this revision.Aug 9 2021, 4:44 PM
craig.topper requested review of this revision.Aug 9 2021, 4:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2021, 4:44 PM

I'll do a further review later when I find some time.

One thing is that I'd be interested to see is this expanded to scalable vector types. Selfishly speaking, our vectorizer will emit the scalable-vector equivalent. The only real difference, I think, is that the strided "constant" is a llvm.experimental.stepvector optionally followed by mul and possibly some other operations. How difficult do you think it would be to support this? I don't see why the llvm loop vectorizer couldn't do something similar if it doesn't already.

I could do that as a follow-up patch: I don't mean to derail this one. I just want to source your thoughts at this stage.

llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp
83

Is there a way of sharing this with RISCVTargetTransformInfo's isLegalElementTypeForRVV?

I'll do a further review later when I find some time.

One thing is that I'd be interested to see is this expanded to scalable vector types. Selfishly speaking, our vectorizer will emit the scalable-vector equivalent. The only real difference, I think, is that the strided "constant" is a llvm.experimental.stepvector optionally followed by mul and possibly some other operations. How difficult do you think it would be to support this? I don't see why the llvm loop vectorizer couldn't do something similar if it doesn't already.

I could do that as a follow-up patch: I don't mean to derail this one. I just want to source your thoughts at this stage.

I don't think it would be too difficult to support. Splat constants on scalable vectors will be hidden behind shuffle+insertelement, but I think that's the same pattern that non-constants use for fixed vectors so I think getSplatValue should still work.

rogfer01 added inline comments.Aug 18 2021, 4:35 AM
llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp
222

I'm curious, you require one of the values to be an instruction but I fail to see where you need that.

Other places where a (specific) instruction is needed seem correctly guarded via dyn_cast.

239

One interesting difference between fixed and scalable is that fixed vectors embed a iota vector as a constant in a vector as the loop header incoming value.

Like this:

vector.body:
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]                 
  %vec.ind = phi <32 x i64> [ <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7, i64 8, i64 9, i64 10, i64 11, i64 12, i64 13, i64 14, i64 15, i64 16, i64 17, i64 18, i64 19, i64 20, i64 21, i64 22, i64 23, i64 24, i64 25, i64 26, i64 27, i64 28, i64 29, i64 30, i64 31>, %entry ], [ %vec.ind.next, %vector.body ]
  %0 = mul nuw nsw <32 x i64> %vec.ind, <i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5>
  %1 = getelementptr inbounds i8, i8* %B, <32 x i64> %0

However with scalable vectorisation (see https://www.godbolt.org/z/Gchx863os ) the vector phi is gone and the iota vector (stepvector) coming from the header is used to compute the vector of indices.

vector.ph:                                        ; preds = %entry
  %4 = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64(), !dbg !22
  ...
  br label %vector.body, !dbg !24
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg !25
  %.splatinsert11 = insertelement <vscale x 2 x i64> poison, i64 %index, i32 0, !dbg !24
  %.splat12 = shufflevector <vscale x 2 x i64> %.splatinsert11, <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer, !dbg !24
  %7 = add <vscale x 2 x i64> %.splat12, %4, !dbg !24
  %8 = mul nuw nsw <vscale x 2 x i64> %7, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 5, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), !dbg !27
  %9 = getelementptr inbounds i8, i8* %B, <vscale x 2 x i64> %8, !dbg !27

So at this point the algorithm needs to diverge a bit because now the phi won't be the base case (there won't be a vector phi). Instead I understand we need to determine we're splatting a scalar recurrence and combining it with a stepvector.

Not that we have to address it now. We may have to bear it in mind in the future if we plan to extend this to scalable vectors.

rogfer01 added inline comments.Aug 18 2021, 4:52 AM
llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp
239

On a second thought, it may happen that stepvector gets optimised in a way that the vector phi is used (similar to the fixed case) so the difference goes away (being able to carry the vector of indices through the loop seems better than synthesising it fully in every iteration).

craig.topper added inline comments.Aug 18 2021, 10:17 AM
llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp
222

I took the idea here from ARMMVEGatherScatterOpt's code to push invariant code out of the loop.

If it one operand isn't an instruction then it would have to be something loop invariant which means the whole operation is invariant. In that case we would never reach a loop phi.

239

That's interesting that the form is different. Is that because the induction variable step would need to be a splat of vscale * fixed element count that would also need to be created?

frasercrmck added inline comments.Aug 19 2021, 1:41 AM
llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp
239

Hmm so @roger01 spurred me to actually see what we're generating and now I think I may have to bow out of the conversation.

Our vectorizer is doing something sufficiently different such that we see a scalar PHI for both fixed- and scalable vectorization. I'll post what we're doing in case it's at all useful for others, but I think I'll have to find some other solution that meets our needs if the following IR isn't something we're going to see coming out of the in-tree vectorizer(s).

Fixed:

vector_body:                        ; preds = %vector_body, %vector_ph
  %lsr.iv4 = phi i32 addrspace(1)* [ %scevgep5, %vector_body ], [ %scevgep3, %vector_phi ]

  %19 = getelementptr i32, i32 addrspace(1)* %lsr.iv4, <4 x i64> <i64 0, i64 2, i64 4, i64 6>
  %20 = tail call <4 x i32> @llvm.masked.gather.v4i32.v4p1i32(<4 x i32 addrspace(1)*> %19, i32 immarg 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef) #5, !noalias !28

Scalable:

vector_ph:
  %15 = tail call <vscale x 1 x i64> @llvm.experimental.stepvector.nxv1i64() #5
  %16 = shl <vscale x 1 x i64> %15, shufflevector (<vscale x 1 x i64> insertelement (<vscale x 1 x i64> undef, i64 1, i32 0), <vscale x 1 x i64> undef, <vscale x 1 x i32> zeroinitializer)
  ...

vector_body:                        ; preds = %vector_body, %vector_ph
  %lsr.iv5 = phi i32 addrspace(1)* [ %29, %vector_body ], [ %scevgep4, %vector_ph ]

  %24 = getelementptr i32, i32 addrspace(1)* %lsr.iv5, <vscale x 1 x i64> %16
  %25 = tail call <vscale x 1 x i32> @llvm.masked.gather.nxv1i32.nxv1p1i32(<vscale x 1 x i32 addrspace(1)*> %24, i32 immarg 4, <vscale x 1 x i1> shufflevector (<vscale x 1 x i1> insertelement (<vscale x 1 x i1> poison, i1 true, i32 0), <vscale x 1 x i1> poison, <vscale x 1 x i32> zeroinitializer), <vscale x 1 x i32> undef) #6, !noalias !28

So I reckon it's definitely easier to pattern match into strided accesses and is uniform across the two vector types, but that's by the by.

Move isLegalElementTypeForRVV from RISCVTargetTransformInfo to RISCVISelLowering to share it.

The TargetTransformInfoWrapperPass only gives access to a generic TargetTransformInfo
pointer. While I could cast that to RISCVTargetTransformInfo, I didn't see any
other targets doing that. So instead I just moved the function to ISelLowering
which can be accessed by this pass and by RISCVTargetTransformInfo.

frasercrmck added inline comments.Sep 10 2021, 7:54 AM
llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp
304

I think this might be clearer as Optional<unsigned>.

344

I think this might be guaranteed by loop simplify form, but here and/or in matchStridedRecurrence I think it might be best to double-check (or bail) if the PHI does/doesn't have exactly two operands or even that the incrementing block is what we expect.

373

no need for new variable?

444

Is this comment right, or in the wrong place?

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
923

No ptrVal? We have a single scalar pointer operand so can't we use that? Or is it something to do with semantics of setting ptrVal that I'm not aware of? Just I'd have thought that size being "unknown/max" we'd prevent any bad AA.

926

I think we can use MemoryLocation::UnknownSize here now. I seem to remember a case a year or so back where IntrinsicInfo's size was unsigned. so couldn't represent the right value. That was fun.

craig.topper added inline comments.Sep 10 2021, 9:03 AM
llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp
344

I can assert here I guess. The 2 operand check for the Phi was done by matchSimpleRecurrence

373

Must be left over from when this code was inlined in tryCreateStridedLoadStore earlier in development.

444

Leftover from an earlier version where I did that after copying it from ARM.

Address review comments

frasercrmck accepted this revision.Sep 16 2021, 8:48 AM

LGTM with one nit. I don't know if @rogfer01 wants to give it a second look?

llvm/lib/Target/RISCV/RISCVGatherScatterLowering.cpp
350

This clang-format should be addressed.

This revision is now accepted and ready to land.Sep 16 2021, 8:48 AM
This revision was landed with ongoing or failed builds.Sep 20 2021, 9:39 AM
This revision was automatically updated to reflect the committed changes.