This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
24/24
RISCVISelLowering.cpp
-
test/CodeGen/RISCV/rvv/
-
CodeGen/
-
RISCV/
-
rvv/
7/7
fixed-vectors-strided-load-combine.ll

Differential D147713

[RISCV] Combine concat_vectors of loads into strided loads
ClosedPublic

Authored by luke on Apr 6 2023, 7:34 AM.

Download Raw Diff

Details

Reviewers

craig.topper
reames

Commits

rG18dc205112df: [RISCV] Combine concat_vectors of loads into strided loads

Summary

If we're concatenating several smaller loads separated by a stride, we
can try and increase the element size and perform a strided load.
For example:

concat_vectors (load v4i8, p+0), (load v4i8, p+n), (load v4i8, p+n*2), (load v4i8, p+n*3)
=>
vlse32 p, stride=n, VL=4

This pattern can be produced by the SLP vectorizer.

A special case is when the stride is exactly equal to the width of the
vector, in which case it can be converted into a single consecutive
vector load. For example:

concat_vectors (load v4i8, p), (load v4i8, p+4), (load v4i8, p+8), (load v4i8, p+12)
=>
vle8 p, VL=16

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

luke created this revision.Apr 6 2023, 7:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 7:34 AM

Herald added subscribers: jobnoorman, asb, pmatos and 29 others. · View Herald Transcript

luke requested review of this revision.Apr 6 2023, 7:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 7:34 AM

Herald added subscribers: llvm-commits, • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B224011: Diff 511412.Apr 6 2023, 7:34 AM

luke added inline comments.Apr 6 2023, 7:38 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11395	This currently only works for strides that use an incremental pattern, e.g. `p, (+ p stride), (+ (+ p stride) stride), ...` A strided load could also be represented with a pointer vector built by stepvector + multiply by stride
11402	Do we need to check the memory VT as well?
11413–11425	This bit could be target agnostic. I have a copy of a patch locally that puts this in DAGCombiner if that would be a better place
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
57	This test case doesn't produce a `concat_vector v0, v1, v2`, it's some other pattern that doesn't get picked up by the combine. Should fix it at some point

luke edited the summary of this revision. (Show Details)Apr 6 2023, 7:39 AM

luke added a parent revision: D147712: [RISCV] Add tests for concats of vectors that could become strided loads.

Generally looks pretty good, a couple small comments. As always, I'd like to wait for @craig.topper's feedback as he knows this part of things much better than I do.

@craig.topper To anticipate one of your objections, this does need to be a DAG combine not IR. The test cases here are IR based, but the original case I saw this (and told Luke) was due to type legalization in DAG. That particular case is now fixed by a costing change, but I suspect we have other such cases.

For follow up, a couple ideas:

We don't need to match a full concat_vector. Any adjacent two element pair is profitable to fold. As a result, we can allow concat_vectors with unrelated operands.

There's an inverse form of this for extract_subvector+store. That one isn't as straight forward to match, and isn't currently being emitted by SLP (or other known source). Given that, I'd suggest deferring for now.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11459	The size on this memory operand is wrong. The memory region accessed isn't the size of the result vector, it's the entire stride region which can be much larger. You can probably just use an unknown size here - unless we already have the utility code for this somewhere else.
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
126	The comment and the code appear out of sync here. I think the code is correct.
326–431	Can you add a couple of negative tests? A case where the stride is not equal. (i.e. we recognize stride mismatch.) A case where the resulting type is not legal. A case with a non-simple load. A case where one of the operands is not a load.

luke added inline comments.Apr 6 2023, 9:00 AM

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
126	Whoops, this is testing the wrong thing. The stride here should be > 16 to test the case where the combined element size > the max EEW

reames added inline comments.Apr 6 2023, 9:01 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11395	Do you have an example here? I'd expect the load pointer operand to be scalar and thus the form you describe would likely have been scalarized.
11413–11425	Sounds like a good follow up to me.

We can't increase element size without checking that the loads are aligned for the new element size.

craig.topper added inline comments.Apr 6 2023, 9:16 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11391	We need all the input loads to have the same chain input
11460	We have to call makeEquivalentMemoryOrdering for all of the loads that were combined.

craig.topper added inline comments.Apr 6 2023, 9:20 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11463	No need for break after return

luke added inline comments.Apr 6 2023, 9:47 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11395	Whoops, not a stepvector, what I meant to say was: %p1offset = mul i64 %stride, 1 %p1 = getelementptr i8, ptr %p, i64 %p1offset %p2offset = mul i64 %stride, 2 %p2 = getelementptr i8, ptr %p, i64 %p2offset %p3offset = mul i64 %stride, 3 %p3 = getelementptr i8, ptr %p, i64 %p3offset With that said I haven't seen this being emitted so far, but I'll keep an eye out to see if this or other patterns show up in the wild.

In D147713#4248991, @craig.topper wrote:

We can't increase element size without checking that the loads are aligned for the new element size.

Good catch. Check me, that requirement can be relaxed if we have fast unaligned for the access in question right?

In D147713#4250230, @reames wrote:

In D147713#4248991, @craig.topper wrote:

We can't increase element size without checking that the loads are aligned for the new element size.

Good catch. Check me, that requirement can be relaxed if we have fast unaligned for the access in question right?

I think so but I don't think we have that indication for vector yet.

luke added inline comments.Apr 7 2023, 6:10 AM

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
326–431	I'm struggling to think of ways to get an illegal result type. We can produce an illegal type when we concat an irregular number of vectors so we get something like `3 x v4i16 -> v12i16`, which is covered by `widen_3xv4i16`, but in that case we never actually do the combine in the first place since it doesn't have a concat_vector that we can match on. `strided_constant_v4i32` handles the case where we would have tried to a strided load of `v2i128`. Did you have a specific example in mind?

Address review comments

luke marked 5 inline comments as done.Apr 7 2023, 6:17 AM

Harbormaster completed remote builds in B224210: Diff 511677.Apr 7 2023, 6:57 AM

luke added inline comments.Apr 7 2023, 7:11 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

11475–11477

Is it legal to increase the alignment here?
E.g. for these loads

%0 = load <4 x i8>, ptr %pix1, align 1
%add.ptr = getelementptr inbounds i8, ptr %pix1, i64 %idx.ext
%2 = load <4 x i8>, ptr %add.ptr, align 1

Can we use an align of 4 * 1:

%0 = call <2 x i32> @llvm.riscv.strided.load ptr %pix1, i64 %idx.ext, align 4

luke added inline comments.Apr 7 2023, 7:40 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

11475–11477

I have a feeling the answer is no, which would mean that we can't combine this in x264 SAD:

c
  #include <stdint.h>
  #include <stdlib.h>
  typedef uint8_t pixel;

  #define PIXEL_SAD_C( name, lx, ly )		    \
      int name( pixel *pix1, intptr_t i_stride_pix1,  \
		pixel *pix2, intptr_t i_stride_pix2 ) \
  {                                                   \
      int i_sum = 0;                                  \
      for( int y = 0; y < ly; y++ )                   \
      {                                               \
	  for( int x = 0; x < lx; x++ )               \
	  {                                           \
	      i_sum += abs( pix1[x] - pix2[x] );      \
	  }                                           \
	  pix1 += i_stride_pix1;                      \
	  pix2 += i_stride_pix2;                      \
      }                                               \
      return i_sum;                                   \
  }

  PIXEL_SAD_C(x264_pixel_sad_4x4, 4, 4)

There's no guarantee here that pix1/pix2/i_stride_pix1/i_stride_pix2 are word aligned so we can't use vlse32. Unless we know it has fast unaligned access?

luke added inline comments.Apr 7 2023, 8:36 AM

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
339–341	@reames FYI, this mirrors the loads from SLP in x264 SAD which have an alignment of 1

reames added inline comments.Apr 10 2023, 10:34 AM

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
339–341	For the purpose of this patch, please update the tests to have the required alignment and add a negative test for the unaligned case. I think this case is generally useful, regardless of the outcome on the spec test. For x266 specifically, let's talk offline.

craig.topper added inline comments.Apr 10 2023, 3:03 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11427	Do we need to call makeEquivalentMemoryOrdering on all the loads here?

Make equivalent memory ordering for all loads in vle case

luke marked an inline comment as done.Apr 11 2023, 1:57 AM

Harbormaster completed remote builds in B224719: Diff 512369.Apr 11 2023, 2:43 AM

craig.topper added inline comments.Apr 11 2023, 8:27 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11366	Can we move this to a function? This is a lot of code to dump into the switch. We've been pretty sloppy about this.
11368	Drop MVT from this comment. It's redundant with "types".
11381	You probably want to check this isn't an extending load either. isNormalLoad should do it I think.
11395	We might want to find common alignment instead?
11423	This also needs allowsMemoryAccessForAlignment right?
11440	Wondering if we should always do integer in case f64 vector isn't supported, but i64 is?
11484	I'm wondering if the bitcast should come before the convertFromScalableVector. That keeps convertFromScalableVector/convertToScalableVector pairs together without having bitcasts between them.

Address review comments

luke marked 6 inline comments as done.Apr 13 2023, 4:38 AM

luke added inline comments.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11440	Good point, I couldn't recreate a test case for this though because we always check that `VT` is legal first. I also tried removing the legal type check, but then we get an assertion when calling `convertFromScalableVector` with an illegal type.

luke marked 4 inline comments as done.Apr 13 2023, 4:39 AM

Harbormaster completed remote builds in B225309: Diff 513172.Apr 13 2023, 5:50 AM

craig.topper added inline comments.Apr 13 2023, 6:52 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11440	The case I was thinking was something like a concat of 2 v2f32 vectors which we would widen to a v2f64 strided load?

craig.topper added inline comments.Apr 13 2023, 6:53 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11440	On a Zve64f target that doesn't support Zve64d.

Add RUN line for zv32f to test that i64 loads are used even if f64 isn't supported

luke marked 3 inline comments as done.Apr 18 2023, 6:17 AM

luke added inline comments.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11440	Ah that makes sense. Added a RUN line for it, should be covered by `@strided_runtime_4xv2f32`

luke marked an inline comment as done.Apr 18 2023, 6:18 AM

Harbormaster completed remote builds in B226369: Diff 514628.Apr 18 2023, 7:12 AM

LGTM

This revision is now accepted and ready to land.Apr 18 2023, 4:22 PM

Closed by commit rG18dc205112df: [RISCV] Combine concat_vectors of loads into strided loads (authored by luke). · Explain WhyApr 19 2023, 1:37 AM

This revision was automatically updated to reflect the committed changes.

luke added a commit: rG18dc205112df: [RISCV] Combine concat_vectors of loads into strided loads.

reames mentioned this in D149375: [RISCV] Introduce unaligned-vector-mem feature.Apr 27 2023, 12:22 PM

reames mentioned this in rGd636bcb6ae51: [RISCV] Introduce unaligned-vector-mem feature.Apr 28 2023, 8:28 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVISelLowering.cpp

123 lines

test/

CodeGen/

RISCV/

rvv/

fixed-vectors-strided-load-combine.ll

126 lines

Diff 511677

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,116 Lines • ▼ Show 20 Lines
if (Subtarget.hasStdExtZfhOrZfhmin())		if (Subtarget.hasStdExtZfhOrZfhmin())
setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);		setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);
if (Subtarget.hasStdExtF())		if (Subtarget.hasStdExtF())
setTargetDAGCombine({ISD::ZERO_EXTEND, ISD::FP_TO_SINT, ISD::FP_TO_UINT,		setTargetDAGCombine({ISD::ZERO_EXTEND, ISD::FP_TO_SINT, ISD::FP_TO_UINT,
ISD::FP_TO_SINT_SAT, ISD::FP_TO_UINT_SAT});		ISD::FP_TO_SINT_SAT, ISD::FP_TO_UINT_SAT});
if (Subtarget.hasVInstructions())		if (Subtarget.hasVInstructions())
setTargetDAGCombine({ISD::FCOPYSIGN, ISD::MGATHER, ISD::MSCATTER,		setTargetDAGCombine({ISD::FCOPYSIGN, ISD::MGATHER, ISD::MSCATTER,
ISD::VP_GATHER, ISD::VP_SCATTER, ISD::SRA, ISD::SRL,		ISD::VP_GATHER, ISD::VP_SCATTER, ISD::SRA, ISD::SRL,
ISD::SHL, ISD::STORE, ISD::SPLAT_VECTOR});		ISD::SHL, ISD::STORE, ISD::SPLAT_VECTOR, ISD::CONCAT_VECTORS});
if (Subtarget.hasVendorXTHeadMemPair())		if (Subtarget.hasVendorXTHeadMemPair())
setTargetDAGCombine({ISD::LOAD, ISD::STORE});		setTargetDAGCombine({ISD::LOAD, ISD::STORE});
if (Subtarget.useRVVForFixedLengthVectors())		if (Subtarget.useRVVForFixedLengthVectors())
setTargetDAGCombine(ISD::BITCAST);		setTargetDAGCombine(ISD::BITCAST);

setLibcallName(RTLIB::FPEXT_F16_F32, "__extendhfsf2");		setLibcallName(RTLIB::FPEXT_F16_F32, "__extendhfsf2");
setLibcallName(RTLIB::FPROUND_F32_F16, "__truncsfhf2");		setLibcallName(RTLIB::FPROUND_F32_F16, "__truncsfhf2");
}		}
▲ Show 20 Lines • Show All 10,223 Lines • ▼ Show 20 Lines	case ISD::SPLAT_VECTOR: {
// Only perform this combine on legal MVT types.		// Only perform this combine on legal MVT types.
if (!isTypeLegal(VT))		if (!isTypeLegal(VT))
break;		break;
if (auto Gather = matchSplatAsGather(N->getOperand(0), VT.getSimpleVT(), N,		if (auto Gather = matchSplatAsGather(N->getOperand(0), VT.getSimpleVT(), N,
DAG, Subtarget))		DAG, Subtarget))
return Gather;		return Gather;
break;		break;
}		}
		case ISD::CONCAT_VECTORS: {
		SDLoc DL(N);
		craig.topperUnsubmitted Done Reply Inline Actions Can we move this to a function? This is a lot of code to dump into the switch. We've been pretty sloppy about this. craig.topper: Can we move this to a function? This is a lot of code to dump into the switch. We've been…
		EVT VT = N->getValueType(0);
		// Only perform this combine on legal MVT types.
		craig.topperUnsubmitted Done Reply Inline Actions Drop MVT from this comment. It's redundant with "types". craig.topper: Drop MVT from this comment. It's redundant with "types".
		if (!isTypeLegal(VT))
		break;

		// TODO: Potentially extend this to scalable vectors
		if (VT.isScalableVector())
		break;

		// If we're concatenating a series of vector loads like
		// concat_vectors (load v4i8, p+0), (load v4i8, p+n), (load v4i8, p+n*2) ...
		// Then we can turn this into a strided load by widening the vector elements
		// vlse32 p, n
		auto *BaseLd = dyn_cast<LoadSDNode>(N->getOperand(0));
		if (!BaseLd \|\| !BaseLd->isSimple() \|\| !SDValue(BaseLd, 0).hasOneUse())
		craig.topperUnsubmitted Done Reply Inline Actions You probably want to check this isn't an extending load either. isNormalLoad should do it I think. craig.topper: You probably want to check this isn't an extending load either. isNormalLoad should do it I…
		break;

		EVT BaseLdVT = BaseLd->getValueType(0);
		SDValue BasePtr = BaseLd->getBasePtr();

		auto IsStrided = [&BaseLd, &BasePtr, &BaseLdVT, &N]() {
		SDValue Stride;
		SDValue CurPtr = BasePtr;

		for (SDValue Op : N->ops().drop_front()) {
		craig.topperUnsubmitted Done Reply Inline Actions We need all the input loads to have the same chain input craig.topper: We need all the input loads to have the same chain input
		auto *Ld = dyn_cast<LoadSDNode>(Op);
		if (!Ld \|\| !Ld->isSimple() \|\| !Op.hasOneUse() \|\|
		Ld->getChain() != BaseLd->getChain() \|\|
		Ld->getAlign() != BaseLd->getAlign() \|\|
		lukeAuthorUnsubmitted Done Reply Inline Actions This currently only works for strides that use an incremental pattern, e.g. `p, (+ p stride), (+ (+ p stride) stride), ...` A strided load could also be represented with a pointer vector built by stepvector + multiply by stride luke: This currently only works for strides that use an incremental pattern, e.g. `p, (+ p stride)…
		reamesUnsubmitted Done Reply Inline Actions Do you have an example here? I'd expect the load pointer operand to be scalar and thus the form you describe would likely have been scalarized. reames: Do you have an example here? I'd expect the load pointer operand to be scalar and thus the…
		lukeAuthorUnsubmitted Done Reply Inline Actions Whoops, not a stepvector, what I meant to say was: %p1offset = mul i64 %stride, 1 %p1 = getelementptr i8, ptr %p, i64 %p1offset %p2offset = mul i64 %stride, 2 %p2 = getelementptr i8, ptr %p, i64 %p2offset %p3offset = mul i64 %stride, 3 %p3 = getelementptr i8, ptr %p, i64 %p3offset With that said I haven't seen this being emitted so far, but I'll keep an eye out to see if this or other patterns show up in the wild. luke: Whoops, not a stepvector, what I meant to say was: ``` %p1offset = mul i64 %stride, 1 %p1 =…
		craig.topperUnsubmitted Done Reply Inline Actions We might want to find common alignment instead? craig.topper: We might want to find common alignment instead?
		Ld->getValueType(0) != BaseLdVT)
		return SDValue();

		SDValue Ptr = Ld->getBasePtr();
		// Check that each load's pointer is (add CurPtr, Stride)
		if (Ptr.getOpcode() != ISD::ADD \|\| Ptr.getOperand(0) != CurPtr)
		return SDValue();
		lukeAuthorUnsubmitted Done Reply Inline Actions Do we need to check the memory VT as well? luke: Do we need to check the memory VT as well?
		SDValue Offset = Ptr.getOperand(1);
		if (!Stride)
		Stride = Offset;
		else if (Offset != Stride)
		return SDValue();

		CurPtr = Ptr;
		}
		return Stride;
		};

		SDValue Stride = IsStrided();
		if (!Stride)
		break;

		// A special case is if the stride is exactly the width of one of the loads,
		// in which case it's contiguous and can be combined into a regular vle
		// without changing the element size
		if (auto *ConstStride = dyn_cast<ConstantSDNode>(Stride)) {
		if (ConstStride->getZExtValue() == BaseLdVT.getFixedSizeInBits() / 8) {
		SDValue WideLoad =
		craig.topperUnsubmitted Done Reply Inline Actions This also needs allowsMemoryAccessForAlignment right? craig.topper: This also needs allowsMemoryAccessForAlignment right?
		DAG.getLoad(VT, DL, BaseLd->getChain(), BasePtr,
		DAG.getMachineFunction().getMachineMemOperand(
		lukeAuthorUnsubmitted Done Reply Inline Actions This bit could be target agnostic. I have a copy of a patch locally that puts this in DAGCombiner if that would be a better place luke: This bit could be target agnostic. I have a copy of a patch locally that puts this in…
		reamesUnsubmitted Done Reply Inline Actions Sounds like a good follow up to me. reames: Sounds like a good follow up to me.
		BaseLd->getMemOperand(), 0, VT.getStoreSize()));
		DAG.makeEquivalentMemoryOrdering(BaseLd, WideLoad);
		craig.topperUnsubmitted Done Reply Inline Actions Do we need to call makeEquivalentMemoryOrdering on all the loads here? craig.topper: Do we need to call makeEquivalentMemoryOrdering on all the loads here?
		return WideLoad;
		}
		}

		// Get the widened scalar type, e.g. v4i8 -> i64
		MVT WideScalarVT;
		unsigned WideScalarBitWidth =
		BaseLdVT.getScalarSizeInBits() * BaseLdVT.getVectorNumElements();
		if (BaseLdVT.isInteger())
		WideScalarVT = MVT::getIntegerVT(WideScalarBitWidth);
		else if (BaseLdVT.isFloatingPoint())
		WideScalarVT = MVT::getFloatingPointVT(WideScalarBitWidth);
		else
		craig.topperUnsubmitted Done Reply Inline Actions Wondering if we should always do integer in case f64 vector isn't supported, but i64 is? craig.topper: Wondering if we should always do integer in case f64 vector isn't supported, but i64 is?
		lukeAuthorUnsubmitted Done Reply Inline Actions Good point, I couldn't recreate a test case for this though because we always check that `VT` is legal first. I also tried removing the legal type check, but then we get an assertion when calling `convertFromScalableVector` with an illegal type. luke: Good point, I couldn't recreate a test case for this though because we always check that `VT`…
		craig.topperUnsubmitted Done Reply Inline Actions The case I was thinking was something like a concat of 2 v2f32 vectors which we would widen to a v2f64 strided load? craig.topper: The case I was thinking was something like a concat of 2 v2f32 vectors which we would widen to…
		craig.topperUnsubmitted Done Reply Inline Actions On a Zve64f target that doesn't support Zve64d. craig.topper: On a Zve64f target that doesn't support Zve64d.
		lukeAuthorUnsubmitted Done Reply Inline Actions Ah that makes sense. Added a RUN line for it, should be covered by `@strided_runtime_4xv2f32` luke: Ah that makes sense. Added a RUN line for it, should be covered by `@strided_runtime_4xv2f32`
		break;

		// Get the vector type for the strided load, e.g. 4 x v4i8 -> v4i64
		MVT WideVecVT = MVT::getVectorVT(WideScalarVT, N->getNumOperands());
		if (!isTypeLegal(WideVecVT))
		break;

		MVT ContainerVT = getContainerForFixedLengthVector(WideVecVT);
		SDValue VL =
		getDefaultVLOps(WideVecVT, ContainerVT, DL, DAG, Subtarget).second;
		SDVTList VTs = DAG.getVTList({ContainerVT, MVT::Other});
		SDValue IntID =
		DAG.getTargetConstant(Intrinsic::riscv_vlse, DL, Subtarget.getXLenVT());
		SDValue Ops[] = {BaseLd->getChain(),
		IntID,
		DAG.getUNDEF(ContainerVT),
		BasePtr,
		Stride,
		VL};
		reamesUnsubmitted Done Reply Inline Actions The size on this memory operand is wrong. The memory region accessed isn't the size of the result vector, it's the entire stride region which can be much larger. You can probably just use an unknown size here - unless we already have the utility code for this somewhere else. reames: The size on this memory operand is wrong. The memory region accessed isn't the size of the…

		craig.topperUnsubmitted Done Reply Inline Actions We have to call makeEquivalentMemoryOrdering for all of the loads that were combined. craig.topper: We have to call makeEquivalentMemoryOrdering for all of the loads that were combined.
		uint64_t MemSize;
		if (auto *ConstStride = dyn_cast<ConstantSDNode>(Stride))
		// total size = (elsize * n) + (stride - elsize) * (n-1)
		craig.topperUnsubmitted Done Reply Inline Actions No need for break after return craig.topper: No need for break after return
		// = elsize + stride * (n-1)
		MemSize = WideScalarVT.getSizeInBits() +
		ConstStride->getSExtValue() * (N->getNumOperands() - 1);
		else
		// If Stride isn't constant, then we can't know how much it will load
		MemSize = MemoryLocation::UnknownSize;
		MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
		BaseLd->getMemOperand(), 0, MemSize);

		// Can't do the combine if the alignment (from the old loads) isn't aligned
		// with the new element type
		if (!allowsMemoryAccessForAlignment(*DAG.getContext(), DAG.getDataLayout(),
		WideVecVT, *MMO))
		break;
		lukeAuthorUnsubmitted Done Reply Inline Actions Is it legal to increase the alignment here? E.g. for these loads %0 = load <4 x i8>, ptr %pix1, align 1 %add.ptr = getelementptr inbounds i8, ptr %pix1, i64 %idx.ext %2 = load <4 x i8>, ptr %add.ptr, align 1 Can we use an align of 4 * 1: %0 = call <2 x i32> @llvm.riscv.strided.load ptr %pix1, i64 %idx.ext, align 4 luke: Is it legal to increase the alignment here? E.g. for these loads ``` %0 = load <4 x i8>, ptr…
		lukeAuthorUnsubmitted Done Reply Inline Actions I have a feeling the answer is no, which would mean that we can't combine this in x264 SAD: c #include <stdint.h> #include <stdlib.h> typedef uint8_t pixel; #define PIXEL_SAD_C( name, lx, ly ) \ int name( pixel pix1, intptr_t i_stride_pix1, \ pixel pix2, intptr_t i_stride_pix2 ) \ { \ int i_sum = 0; \ for( int y = 0; y < ly; y++ ) \ { \ for( int x = 0; x < lx; x++ ) \ { \ i_sum += abs( pix1[x] - pix2[x] ); \ } \ pix1 += i_stride_pix1; \ pix2 += i_stride_pix2; \ } \ return i_sum; \ } PIXEL_SAD_C(x264_pixel_sad_4x4, 4, 4) There's no guarantee here that `pix1`/`pix2`/`i_stride_pix1`/`i_stride_pix2` are word aligned so we can't use vlse32. Unless we know it has fast unaligned access? luke: I have a feeling the answer is no, which would mean that we can't combine this in x264 SAD…

		SDValue StridedLoad = DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL,
		VTs, Ops, WideVecVT, MMO);
		for (SDValue Ld : N->ops())
		DAG.makeEquivalentMemoryOrdering(cast<LoadSDNode>(Ld), StridedLoad);
		return DAG.getBitcast(
		VT, convertFromScalableVector(WideVecVT, StridedLoad, DAG, Subtarget));
		craig.topperUnsubmitted Done Reply Inline Actions I'm wondering if the bitcast should come before the convertFromScalableVector. That keeps convertFromScalableVector/convertToScalableVector pairs together without having bitcasts between them. craig.topper: I'm wondering if the bitcast should come before the convertFromScalableVector. That keeps…
		}
case RISCVISD::VMV_V_X_VL: {		case RISCVISD::VMV_V_X_VL: {
// Tail agnostic VMV.V.X only demands the vector element bitwidth from the		// Tail agnostic VMV.V.X only demands the vector element bitwidth from the
// scalar input.		// scalar input.
unsigned ScalarSize = N->getOperand(1).getValueSizeInBits();		unsigned ScalarSize = N->getOperand(1).getValueSizeInBits();
unsigned EltWidth = N->getValueType(0).getScalarSizeInBits();		unsigned EltWidth = N->getValueType(0).getScalarSizeInBits();
if (ScalarSize > EltWidth && N->getOperand(0).isUndef())		if (ScalarSize > EltWidth && N->getOperand(0).isUndef())
if (SimplifyDemandedLowBitsHelper(1, EltWidth))		if (SimplifyDemandedLowBitsHelper(1, EltWidth))
return SDValue(N, 0);		return SDValue(N, 0);
▲ Show 20 Lines • Show All 3,941 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
; RUN: llc -mtriple=riscv32 -mattr=+v,+zfh,+experimental-zvfh -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=CHECK,RV32		; RUN: llc -mtriple=riscv32 -mattr=+v,+zfh,+experimental-zvfh -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=CHECK,RV32
; RUN: llc -mtriple=riscv64 -mattr=+v,+zfh,+experimental-zvfh -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=CHECK,RV64		; RUN: llc -mtriple=riscv64 -mattr=+v,+zfh,+experimental-zvfh -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=CHECK,RV64

; The two loads are contigous and should be folded into one		; The two loads are contigous and should be folded into one
define void @widen_2xv4i16(ptr %x, ptr %z) {		define void @widen_2xv4i16(ptr %x, ptr %z) {
; CHECK-LABEL: widen_2xv4i16:		; CHECK-LABEL: widen_2xv4i16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: addi a0, a0, 8
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4		; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 8		%b.gep = getelementptr i8, ptr %x, i64 8
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
Show All 30 Lines
; RV64-NEXT: vsetivli zero, 12, e16, m2, tu, ma		; RV64-NEXT: vsetivli zero, 12, e16, m2, tu, ma
; RV64-NEXT: vslideup.vi v8, v12, 8		; RV64-NEXT: vslideup.vi v8, v12, 8
; RV64-NEXT: vsetivli zero, 1, e64, m2, ta, ma		; RV64-NEXT: vsetivli zero, 1, e64, m2, ta, ma
; RV64-NEXT: vslidedown.vi v10, v8, 2		; RV64-NEXT: vslidedown.vi v10, v8, 2
; RV64-NEXT: addi a0, a1, 16		; RV64-NEXT: addi a0, a1, 16
; RV64-NEXT: vse64.v v10, (a0)		; RV64-NEXT: vse64.v v10, (a0)
; RV64-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; RV64-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; RV64-NEXT: vse16.v v8, (a1)		; RV64-NEXT: vse16.v v8, (a1)
; RV64-NEXT: ret		; RV64-NEXT: ret
		lukeAuthorUnsubmitted Done Reply Inline Actions This test case doesn't produce a `concat_vector v0, v1, v2`, it's some other pattern that doesn't get picked up by the combine. Should fix it at some point luke: This test case doesn't produce a `concat_vector v0, v1, v2`, it's some other pattern that…
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 8		%b.gep = getelementptr i8, ptr %x, i64 8
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 8		%c.gep = getelementptr i8, ptr %b.gep, i64 8
%c = load <4 x i16>, ptr %c.gep		%c = load <4 x i16>, ptr %c.gep
%d.0 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%d.0 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%d.1 = shufflevector <4 x i16> %c, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>		%d.1 = shufflevector <4 x i16> %c, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
%d.2 = shufflevector <8 x i16> %d.0, <8 x i16> %d.1, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>		%d.2 = shufflevector <8 x i16> %d.0, <8 x i16> %d.1, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
store <12 x i16> %d.2, ptr %z		store <12 x i16> %d.2, ptr %z
ret void		ret void
}		}

define void @widen_4xv4i16(ptr %x, ptr %z) {		define void @widen_4xv4i16(ptr %x, ptr %z) {
; CHECK-LABEL: widen_4xv4i16:		; CHECK-LABEL: widen_4xv4i16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: addi a2, a0, 8
; CHECK-NEXT: vle16.v v10, (a2)
; CHECK-NEXT: addi a2, a0, 16
; CHECK-NEXT: vle16.v v12, (a2)
; CHECK-NEXT: addi a0, a0, 24
; CHECK-NEXT: vle16.v v14, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v10, 4
; CHECK-NEXT: vsetivli zero, 12, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v12, 8
; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma		; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v14, 12		; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 8		%b.gep = getelementptr i8, ptr %x, i64 8
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 8		%c.gep = getelementptr i8, ptr %b.gep, i64 8
%c = load <4 x i16>, ptr %c.gep		%c = load <4 x i16>, ptr %c.gep
%d.gep = getelementptr i8, ptr %c.gep, i64 8		%d.gep = getelementptr i8, ptr %c.gep, i64 8
%d = load <4 x i16>, ptr %d.gep		%d = load <4 x i16>, ptr %d.gep
%e.0 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.0 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%e.1 = shufflevector <4 x i16> %c, <4 x i16> %d, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.1 = shufflevector <4 x i16> %c, <4 x i16> %d, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
store <16 x i16> %e.2, ptr %z		store <16 x i16> %e.2, ptr %z
ret void		ret void
}		}

; Should be a strided load - with type coercion to i64		; Should be a strided load - with type coercion to i64
define void @strided_constant(ptr %x, ptr %z) {		define void @strided_constant(ptr %x, ptr %z) {
; CHECK-LABEL: strided_constant:		; CHECK-LABEL: strided_constant:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: li a2, 16
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: addi a0, a0, 16		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 16		%b.gep = getelementptr i8, ptr %x, i64 16
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Should be a strided load		; Should be a strided load
define void @strided_constant_64(ptr %x, ptr %z) {		define void @strided_constant_64(ptr %x, ptr %z) {
; CHECK-LABEL: strided_constant_64:		; CHECK-LABEL: strided_constant_64:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: li a2, 64
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: addi a0, a0, 64		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 64		%b.gep = getelementptr i8, ptr %x, i64 64
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Vector is too large to fit into a single strided load		; Vector is too large to fit into a single strided load
define void @strided_constant_v4i32(ptr %x, ptr %z) {		define void @strided_constant_v4i32(ptr %x, ptr %z) {
		reamesUnsubmitted Done Reply Inline Actions The comment and the code appear out of sync here. I think the code is correct. reames: The comment and the code appear out of sync here. I think the code is correct.
		lukeAuthorUnsubmitted Done Reply Inline Actions Whoops, this is testing the wrong thing. The stride here should be > 16 to test the case where the combined element size > the max EEW luke: Whoops, this is testing the wrong thing. The stride here should be > 16 to test the case where…
; CHECK-LABEL: strided_constant_v4i32:		; CHECK-LABEL: strided_constant_v4i32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
; CHECK-NEXT: vle32.v v8, (a0)		; CHECK-NEXT: vle32.v v8, (a0)
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: vle32.v v10, (a0)		; CHECK-NEXT: vle32.v v10, (a0)
; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v10, 4		; CHECK-NEXT: vslideup.vi v8, v10, 4
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
store <16 x i16> %e.2, ptr %z		store <16 x i16> %e.2, ptr %z
ret void		ret void
}		}

define void @strided_runtime(ptr %x, ptr %z, i64 %s) {		define void @strided_runtime(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_runtime:		; CHECK-LABEL: strided_runtime:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

define void @strided_runtime_4xv4i16(ptr %x, ptr %z, i64 %s) {		define void @strided_runtime_4xv4i16(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_runtime_4xv4i16:		; CHECK-LABEL: strided_runtime_4xv4i16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v10, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v12, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v14, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v10, 4
; CHECK-NEXT: vsetivli zero, 12, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v12, 8
; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v14, 12
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 %s		%c.gep = getelementptr i8, ptr %b.gep, i64 %s
%c = load <4 x i16>, ptr %c.gep		%c = load <4 x i16>, ptr %c.gep
%d.gep = getelementptr i8, ptr %c.gep, i64 %s		%d.gep = getelementptr i8, ptr %c.gep, i64 %s
%d = load <4 x i16>, ptr %d.gep		%d = load <4 x i16>, ptr %d.gep
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	; RV64-NEXT: ret
%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
store <16 x i16> %e.2, ptr %z		store <16 x i16> %e.2, ptr %z
ret void		ret void
}		}

define void @strided_runtime_4xv4f16(ptr %x, ptr %z, i64 %s) {		define void @strided_runtime_4xv4f16(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_runtime_4xv4f16:		; CHECK-LABEL: strided_runtime_4xv4f16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v10, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v12, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v14, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v10, 4
; CHECK-NEXT: vsetivli zero, 12, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v12, 8
; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v14, 12
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x half>, ptr %x		%a = load <4 x half>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x half>, ptr %b.gep		%b = load <4 x half>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 %s		%c.gep = getelementptr i8, ptr %b.gep, i64 %s
%c = load <4 x half>, ptr %c.gep		%c = load <4 x half>, ptr %c.gep
%d.gep = getelementptr i8, ptr %c.gep, i64 %s		%d.gep = getelementptr i8, ptr %c.gep, i64 %s
%d = load <4 x half>, ptr %d.gep		%d = load <4 x half>, ptr %d.gep
%e.0 = shufflevector <4 x half> %a, <4 x half> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.0 = shufflevector <4 x half> %a, <4 x half> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%e.1 = shufflevector <4 x half> %c, <4 x half> %d, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.1 = shufflevector <4 x half> %c, <4 x half> %d, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%e.2 = shufflevector <8 x half> %e.0, <8 x half> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%e.2 = shufflevector <8 x half> %e.0, <8 x half> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
store <16 x half> %e.2, ptr %z		store <16 x half> %e.2, ptr %z
ret void		ret void
}		}

define void @strided_runtime_4xv2f32(ptr %x, ptr %z, i64 %s) {		define void @strided_runtime_4xv2f32(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_runtime_4xv2f32:		; CHECK-LABEL: strided_runtime_4xv2f32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
; CHECK-NEXT: vle32.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle32.v v10, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle32.v v12, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle32.v v14, (a0)
; CHECK-NEXT: vsetivli zero, 4, e32, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v10, 2
; CHECK-NEXT: vsetivli zero, 6, e32, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v12, 4
; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v14, 6
; CHECK-NEXT: vse32.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <2 x float>, ptr %x		%a = load <2 x float>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <2 x float>, ptr %b.gep		%b = load <2 x float>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 %s		%c.gep = getelementptr i8, ptr %b.gep, i64 %s
%c = load <2 x float>, ptr %c.gep		%c = load <2 x float>, ptr %c.gep
%d.gep = getelementptr i8, ptr %c.gep, i64 %s		%d.gep = getelementptr i8, ptr %c.gep, i64 %s
%d = load <2 x float>, ptr %d.gep		%d = load <2 x float>, ptr %d.gep
%e.0 = shufflevector <2 x float> %a, <2 x float> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		%e.0 = shufflevector <2 x float> %a, <2 x float> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%e.1 = shufflevector <2 x float> %c, <2 x float> %d, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		%e.1 = shufflevector <2 x float> %c, <2 x float> %d, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%e.2 = shufflevector <4 x float> %e.0, <4 x float> %e.1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.2 = shufflevector <4 x float> %e.0, <4 x float> %e.1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x float> %e.2, ptr %z		store <8 x float> %e.2, ptr %z
ret void		ret void
}		}

; Shouldn't be combined because the resulting load would not be aligned		; Shouldn't be combined because the resulting load would not be aligned
define void @strided_unaligned(ptr %x, ptr %z, i64 %s) {		define void @strided_unaligned(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_unaligned:		; CHECK-LABEL: strided_unaligned:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vle8.v v8, (a0)		; CHECK-NEXT: vle8.v v8, (a0)
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle8.v v9, (a0)		; CHECK-NEXT: vle8.v v9, (a0)
; CHECK-NEXT: vslideup.vi v8, v9, 4		; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x, align 1		%a = load <4 x i16>, ptr %x, align 1
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep, align 1		%b = load <4 x i16>, ptr %b.gep, align 1
		lukeAuthorUnsubmitted Done Reply Inline Actions @reames FYI, this mirrors the loads from SLP in x264 SAD which have an alignment of 1 luke: @reames FYI, this mirrors the loads from SLP in x264 SAD which have an alignment of 1
		reamesUnsubmitted Done Reply Inline Actions For the purpose of this patch, please update the tests to have the required alignment and add a negative test for the unaligned case. I think this case is generally useful, regardless of the outcome on the spec test. For x266 specifically, let's talk offline. reames: For the purpose of this patch, please update the tests to have the required alignment and add a…
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Shouldn't be combined because the loads have different alignments		; Shouldn't be combined because the loads have different alignments
define void @strided_mismatched_alignments(ptr %x, ptr %z, i64 %s) {		define void @strided_mismatched_alignments(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_mismatched_alignments:		; CHECK-LABEL: strided_mismatched_alignments:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4		; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x, align 8		%a = load <4 x i16>, ptr %x, align 8
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep, align 16		%b = load <4 x i16>, ptr %b.gep, align 16
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

define void @strided_ok_alignments_8(ptr %x, ptr %z, i64 %s) {		define void @strided_ok_alignments_8(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_ok_alignments_8:		; CHECK-LABEL: strided_ok_alignments_8:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x, align 8		%a = load <4 x i16>, ptr %x, align 8
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep, align 8		%b = load <4 x i16>, ptr %b.gep, align 8
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

define void @strided_ok_alignments_16(ptr %x, ptr %z, i64 %s) {		define void @strided_ok_alignments_16(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_ok_alignments_16:		; CHECK-LABEL: strided_ok_alignments_16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x, align 16		%a = load <4 x i16>, ptr %x, align 16
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep, align 16		%b = load <4 x i16>, ptr %b.gep, align 16
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Shouldn't be combined because one of the loads is not simple		; Shouldn't be combined because one of the loads is not simple
define void @strided_non_simple_load(ptr %x, ptr %z, i64 %s) {		define void @strided_non_simple_load(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_non_simple_load:		; CHECK-LABEL: strided_non_simple_load:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4		; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load volatile <4 x i16>, ptr %b.gep		%b = load volatile <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Shouldn't be combined because one of the operands is not a load		; Shouldn't be combined because one of the operands is not a load
define void @strided_non_load(ptr %x, ptr %z, <4 x i16> %b) {		define void @strided_non_load(ptr %x, ptr %z, <4 x i16> %b) {
; CHECK-LABEL: strided_non_load:		; CHECK-LABEL: strided_non_load:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v9, v8, 4		; CHECK-NEXT: vslideup.vi v9, v8, 4
; CHECK-NEXT: vse16.v v9, (a1)		; CHECK-NEXT: vse16.v v9, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}
		reamesUnsubmitted Done Reply Inline Actions Can you add a couple of negative tests? A case where the stride is not equal. (i.e. we recognize stride mismatch.) A case where the resulting type is not legal. A case with a non-simple load. A case where one of the operands is not a load. reames: Can you add a couple of negative tests? 1) A case where the stride is not equal. (i.e. we…
		lukeAuthorUnsubmitted Done Reply Inline Actions I'm struggling to think of ways to get an illegal result type. We can produce an illegal type when we concat an irregular number of vectors so we get something like `3 x v4i16 -> v12i16`, which is covered by `widen_3xv4i16`, but in that case we never actually do the combine in the first place since it doesn't have a concat_vector that we can match on. `strided_constant_v4i32` handles the case where we would have tried to a strided load of `v2i128`. Did you have a specific example in mind? luke: I'm struggling to think of ways to get an illegal result type. We can produce an illegal type…

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Combine concat_vectors of loads into strided loadsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 511677

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll

[RISCV] Combine concat_vectors of loads into strided loads
ClosedPublic