This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Combine concat_vectors of loads into strided loads
ClosedPublic

Authored by luke on Apr 6 2023, 7:34 AM.

Details

Summary

If we're concatenating several smaller loads separated by a stride, we
can try and increase the element size and perform a strided load.
For example:

concat_vectors (load v4i8, p+0), (load v4i8, p+n), (load v4i8, p+n*2), (load v4i8, p+n*3)
=>
vlse32 p, stride=n, VL=4

This pattern can be produced by the SLP vectorizer.

A special case is when the stride is exactly equal to the width of the
vector, in which case it can be converted into a single consecutive
vector load. For example:

concat_vectors (load v4i8, p), (load v4i8, p+4), (load v4i8, p+8), (load v4i8, p+12)
=>
vle8 p, VL=16

Diff Detail

Event Timeline

luke created this revision.Apr 6 2023, 7:34 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 7:34 AM
luke requested review of this revision.Apr 6 2023, 7:34 AM
luke added inline comments.Apr 6 2023, 7:38 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11395

This currently only works for strides that use an incremental pattern, e.g. p, (+ p stride), (+ (+ p stride) stride), ...
A strided load could also be represented with a pointer vector built by stepvector + multiply by stride

11402

Do we need to check the memory VT as well?

11413–11425

This bit could be target agnostic. I have a copy of a patch locally that puts this in DAGCombiner if that would be a better place

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
57

This test case doesn't produce a concat_vector v0, v1, v2, it's some other pattern that doesn't get picked up by the combine. Should fix it at some point

reames added a comment.Apr 6 2023, 8:46 AM

Generally looks pretty good, a couple small comments. As always, I'd like to wait for @craig.topper's feedback as he knows this part of things much better than I do.

@craig.topper To anticipate one of your objections, this does need to be a DAG combine not IR. The test cases here are IR based, but the original case I saw this (and told Luke) was due to type legalization in DAG. That particular case is now fixed by a costing change, but I suspect we have other such cases.

For follow up, a couple ideas:

  1. We don't need to match a full concat_vector. Any adjacent two element pair is profitable to fold. As a result, we can allow concat_vectors with unrelated operands.
  1. There's an inverse form of this for extract_subvector+store. That one isn't as straight forward to match, and isn't currently being emitted by SLP (or other known source). Given that, I'd suggest deferring for now.
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11459

The size on this memory operand is wrong. The memory region accessed isn't the size of the result vector, it's the entire stride region which can be much larger. You can probably just use an unknown size here - unless we already have the utility code for this somewhere else.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
126

The comment and the code appear out of sync here. I think the code is correct.

326–431

Can you add a couple of negative tests?

  1. A case where the stride is not equal. (i.e. we recognize stride mismatch.)
  1. A case where the resulting type is not legal.
  1. A case with a non-simple load.
  1. A case where one of the operands is not a load.
luke added inline comments.Apr 6 2023, 9:00 AM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
126

Whoops, this is testing the wrong thing. The stride here should be > 16 to test the case where the combined element size > the max EEW

reames added inline comments.Apr 6 2023, 9:01 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11395

Do you have an example here? I'd expect the load pointer operand to be scalar and thus the form you describe would likely have been scalarized.

11413–11425

Sounds like a good follow up to me.

We can't increase element size without checking that the loads are aligned for the new element size.

craig.topper added inline comments.Apr 6 2023, 9:16 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11391

We need all the input loads to have the same chain input

11460

We have to call makeEquivalentMemoryOrdering for all of the loads that were combined.

craig.topper added inline comments.Apr 6 2023, 9:20 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11463

No need for break after return

luke added inline comments.Apr 6 2023, 9:47 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11395

Whoops, not a stepvector, what I meant to say was:

%p1offset = mul i64 %stride, 1
%p1 = getelementptr i8, ptr %p, i64 %p1offset

%p2offset = mul i64 %stride, 2
%p2 = getelementptr i8, ptr %p, i64 %p2offset

%p3offset = mul i64 %stride, 3
%p3 = getelementptr i8, ptr %p, i64 %p3offset

With that said I haven't seen this being emitted so far, but I'll keep an eye out to see if this or other patterns show up in the wild.

reames added a comment.Apr 6 2023, 5:44 PM

We can't increase element size without checking that the loads are aligned for the new element size.

Good catch. Check me, that requirement can be relaxed if we have fast unaligned for the access in question right?

We can't increase element size without checking that the loads are aligned for the new element size.

Good catch. Check me, that requirement can be relaxed if we have fast unaligned for the access in question right?

I think so but I don't think we have that indication for vector yet.

luke added inline comments.Apr 7 2023, 6:10 AM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
326–431

I'm struggling to think of ways to get an illegal result type.
We can produce an illegal type when we concat an irregular number of vectors so we get something like 3 x v4i16 -> v12i16, which is covered by widen_3xv4i16, but in that case we never actually do the combine in the first place since it doesn't have a concat_vector that we can match on.
strided_constant_v4i32 handles the case where we would have tried to a strided load of v2i128.

Did you have a specific example in mind?

luke updated this revision to Diff 511677.Apr 7 2023, 6:16 AM

Address review comments

luke marked 5 inline comments as done.Apr 7 2023, 6:17 AM
luke added inline comments.Apr 7 2023, 7:11 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11475–11477

Is it legal to increase the alignment here?
E.g. for these loads

%0 = load <4 x i8>, ptr %pix1, align 1
%add.ptr = getelementptr inbounds i8, ptr %pix1, i64 %idx.ext
%2 = load <4 x i8>, ptr %add.ptr, align 1

Can we use an align of 4 * 1:

%0 = call <2 x i32> @llvm.riscv.strided.load ptr %pix1, i64 %idx.ext, align 4
luke added inline comments.Apr 7 2023, 7:40 AM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11475–11477

I have a feeling the answer is no, which would mean that we can't combine this in x264 SAD:

c
  #include <stdint.h>
  #include <stdlib.h>
  typedef uint8_t pixel;

  #define PIXEL_SAD_C( name, lx, ly )		    \
      int name( pixel *pix1, intptr_t i_stride_pix1,  \
		pixel *pix2, intptr_t i_stride_pix2 ) \
  {                                                   \
      int i_sum = 0;                                  \
      for( int y = 0; y < ly; y++ )                   \
      {                                               \
	  for( int x = 0; x < lx; x++ )               \
	  {                                           \
	      i_sum += abs( pix1[x] - pix2[x] );      \
	  }                                           \
	  pix1 += i_stride_pix1;                      \
	  pix2 += i_stride_pix2;                      \
      }                                               \
      return i_sum;                                   \
  }

  PIXEL_SAD_C(x264_pixel_sad_4x4, 4, 4)

There's no guarantee here that pix1/pix2/i_stride_pix1/i_stride_pix2 are word aligned so we can't use vlse32. Unless we know it has fast unaligned access?

luke added inline comments.Apr 7 2023, 8:36 AM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
339–341

@reames FYI, this mirrors the loads from SLP in x264 SAD which have an alignment of 1

reames added inline comments.Apr 10 2023, 10:34 AM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
339–341

For the purpose of this patch, please update the tests to have the required alignment and add a negative test for the unaligned case. I think this case is generally useful, regardless of the outcome on the spec test.

For x266 specifically, let's talk offline.

craig.topper added inline comments.Apr 10 2023, 3:03 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11427

Do we need to call makeEquivalentMemoryOrdering on all the loads here?

luke updated this revision to Diff 512369.Apr 11 2023, 1:56 AM

Make equivalent memory ordering for all loads in vle case

luke marked an inline comment as done.Apr 11 2023, 1:57 AM
craig.topper added inline comments.Apr 11 2023, 8:27 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11366

Can we move this to a function? This is a lot of code to dump into the switch. We've been pretty sloppy about this.

11368

Drop MVT from this comment. It's redundant with "types".

11381

You probably want to check this isn't an extending load either. isNormalLoad should do it I think.

11395

We might want to find common alignment instead?

11423

This also needs allowsMemoryAccessForAlignment right?

11440

Wondering if we should always do integer in case f64 vector isn't supported, but i64 is?

11484

I'm wondering if the bitcast should come before the convertFromScalableVector. That keeps convertFromScalableVector/convertToScalableVector pairs together without having bitcasts between them.

luke updated this revision to Diff 513172.Apr 13 2023, 4:35 AM

Address review comments

luke marked 6 inline comments as done.Apr 13 2023, 4:38 AM
luke added inline comments.
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11440

Good point, I couldn't recreate a test case for this though because we always check that VT is legal first. I also tried removing the legal type check, but then we get an assertion when calling convertFromScalableVector with an illegal type.

luke marked 4 inline comments as done.Apr 13 2023, 4:39 AM
craig.topper added inline comments.Apr 13 2023, 6:52 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11440

The case I was thinking was something like a concat of 2 v2f32 vectors which we would widen to a v2f64 strided load?

craig.topper added inline comments.Apr 13 2023, 6:53 PM
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11440

On a Zve64f target that doesn't support Zve64d.

luke updated this revision to Diff 514628.Apr 18 2023, 6:16 AM

Add RUN line for zv32f to test that i64 loads are used even if f64 isn't supported

luke marked 3 inline comments as done.Apr 18 2023, 6:17 AM
luke added inline comments.
llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11440

Ah that makes sense. Added a RUN line for it, should be covered by @strided_runtime_4xv2f32

luke marked an inline comment as done.Apr 18 2023, 6:18 AM
This revision is now accepted and ready to land.Apr 18 2023, 4:22 PM
This revision was automatically updated to reflect the committed changes.