This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
24/24
RISCVISelLowering.cpp
-
test/CodeGen/RISCV/rvv/
-
CodeGen/
-
RISCV/
-
rvv/
7/7
fixed-vectors-strided-load-combine.ll

Differential D147713

[RISCV] Combine concat_vectors of loads into strided loads
ClosedPublic

Authored by luke on Apr 6 2023, 7:34 AM.

Download Raw Diff

Details

Reviewers

craig.topper
reames

Commits

rG18dc205112df: [RISCV] Combine concat_vectors of loads into strided loads

Summary

If we're concatenating several smaller loads separated by a stride, we
can try and increase the element size and perform a strided load.
For example:

concat_vectors (load v4i8, p+0), (load v4i8, p+n), (load v4i8, p+n*2), (load v4i8, p+n*3)
=>
vlse32 p, stride=n, VL=4

This pattern can be produced by the SLP vectorizer.

A special case is when the stride is exactly equal to the width of the
vector, in which case it can be converted into a single consecutive
vector load. For example:

concat_vectors (load v4i8, p), (load v4i8, p+4), (load v4i8, p+8), (load v4i8, p+12)
=>
vle8 p, VL=16

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

luke created this revision.Apr 6 2023, 7:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 7:34 AM

Herald added subscribers: jobnoorman, asb, pmatos and 29 others. · View Herald Transcript

luke requested review of this revision.Apr 6 2023, 7:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 6 2023, 7:34 AM

Herald added subscribers: llvm-commits, • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B224011: Diff 511412.Apr 6 2023, 7:34 AM

luke added inline comments.Apr 6 2023, 7:38 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11689	This currently only works for strides that use an incremental pattern, e.g. `p, (+ p stride), (+ (+ p stride) stride), ...` A strided load could also be represented with a pointer vector built by stepvector + multiply by stride
11696	Do we need to check the memory VT as well?
11707–11719	This bit could be target agnostic. I have a copy of a patch locally that puts this in DAGCombiner if that would be a better place
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
57	This test case doesn't produce a `concat_vector v0, v1, v2`, it's some other pattern that doesn't get picked up by the combine. Should fix it at some point

luke edited the summary of this revision. (Show Details)Apr 6 2023, 7:39 AM

luke added a parent revision: D147712: [RISCV] Add tests for concats of vectors that could become strided loads.

Generally looks pretty good, a couple small comments. As always, I'd like to wait for @craig.topper's feedback as he knows this part of things much better than I do.

@craig.topper To anticipate one of your objections, this does need to be a DAG combine not IR. The test cases here are IR based, but the original case I saw this (and told Luke) was due to type legalization in DAG. That particular case is now fixed by a costing change, but I suspect we have other such cases.

For follow up, a couple ideas:

We don't need to match a full concat_vector. Any adjacent two element pair is profitable to fold. As a result, we can allow concat_vectors with unrelated operands.

There's an inverse form of this for extract_subvector+store. That one isn't as straight forward to match, and isn't currently being emitted by SLP (or other known source). Given that, I'd suggest deferring for now.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11753	The size on this memory operand is wrong. The memory region accessed isn't the size of the result vector, it's the entire stride region which can be much larger. You can probably just use an unknown size here - unless we already have the utility code for this somewhere else.
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
126	The comment and the code appear out of sync here. I think the code is correct.
326–427	Can you add a couple of negative tests? A case where the stride is not equal. (i.e. we recognize stride mismatch.) A case where the resulting type is not legal. A case with a non-simple load. A case where one of the operands is not a load.

luke added inline comments.Apr 6 2023, 9:00 AM

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
126	Whoops, this is testing the wrong thing. The stride here should be > 16 to test the case where the combined element size > the max EEW

reames added inline comments.Apr 6 2023, 9:01 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11689	Do you have an example here? I'd expect the load pointer operand to be scalar and thus the form you describe would likely have been scalarized.
11707–11719	Sounds like a good follow up to me.

We can't increase element size without checking that the loads are aligned for the new element size.

craig.topper added inline comments.Apr 6 2023, 9:16 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11685	We need all the input loads to have the same chain input
11754	We have to call makeEquivalentMemoryOrdering for all of the loads that were combined.

craig.topper added inline comments.Apr 6 2023, 9:20 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11757	No need for break after return

luke added inline comments.Apr 6 2023, 9:47 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11689	Whoops, not a stepvector, what I meant to say was: %p1offset = mul i64 %stride, 1 %p1 = getelementptr i8, ptr %p, i64 %p1offset %p2offset = mul i64 %stride, 2 %p2 = getelementptr i8, ptr %p, i64 %p2offset %p3offset = mul i64 %stride, 3 %p3 = getelementptr i8, ptr %p, i64 %p3offset With that said I haven't seen this being emitted so far, but I'll keep an eye out to see if this or other patterns show up in the wild.

In D147713#4248991, @craig.topper wrote:

We can't increase element size without checking that the loads are aligned for the new element size.

Good catch. Check me, that requirement can be relaxed if we have fast unaligned for the access in question right?

In D147713#4250230, @reames wrote:

In D147713#4248991, @craig.topper wrote:

We can't increase element size without checking that the loads are aligned for the new element size.

Good catch. Check me, that requirement can be relaxed if we have fast unaligned for the access in question right?

I think so but I don't think we have that indication for vector yet.

luke added inline comments.Apr 7 2023, 6:10 AM

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
326–427	I'm struggling to think of ways to get an illegal result type. We can produce an illegal type when we concat an irregular number of vectors so we get something like `3 x v4i16 -> v12i16`, which is covered by `widen_3xv4i16`, but in that case we never actually do the combine in the first place since it doesn't have a concat_vector that we can match on. `strided_constant_v4i32` handles the case where we would have tried to a strided load of `v2i128`. Did you have a specific example in mind?

Address review comments

luke marked 5 inline comments as done.Apr 7 2023, 6:17 AM

Harbormaster completed remote builds in B224210: Diff 511677.Apr 7 2023, 6:57 AM

luke added inline comments.Apr 7 2023, 7:11 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

11769–11771

Is it legal to increase the alignment here?
E.g. for these loads

%0 = load <4 x i8>, ptr %pix1, align 1
%add.ptr = getelementptr inbounds i8, ptr %pix1, i64 %idx.ext
%2 = load <4 x i8>, ptr %add.ptr, align 1

Can we use an align of 4 * 1:

%0 = call <2 x i32> @llvm.riscv.strided.load ptr %pix1, i64 %idx.ext, align 4

luke added inline comments.Apr 7 2023, 7:40 AM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

11769–11771

I have a feeling the answer is no, which would mean that we can't combine this in x264 SAD:

c
  #include <stdint.h>
  #include <stdlib.h>
  typedef uint8_t pixel;

  #define PIXEL_SAD_C( name, lx, ly )		    \
      int name( pixel *pix1, intptr_t i_stride_pix1,  \
		pixel *pix2, intptr_t i_stride_pix2 ) \
  {                                                   \
      int i_sum = 0;                                  \
      for( int y = 0; y < ly; y++ )                   \
      {                                               \
	  for( int x = 0; x < lx; x++ )               \
	  {                                           \
	      i_sum += abs( pix1[x] - pix2[x] );      \
	  }                                           \
	  pix1 += i_stride_pix1;                      \
	  pix2 += i_stride_pix2;                      \
      }                                               \
      return i_sum;                                   \
  }

  PIXEL_SAD_C(x264_pixel_sad_4x4, 4, 4)

There's no guarantee here that pix1/pix2/i_stride_pix1/i_stride_pix2 are word aligned so we can't use vlse32. Unless we know it has fast unaligned access?

luke added inline comments.Apr 7 2023, 8:36 AM

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
339–341	@reames FYI, this mirrors the loads from SLP in x264 SAD which have an alignment of 1

reames added inline comments.Apr 10 2023, 10:34 AM

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
339–341	For the purpose of this patch, please update the tests to have the required alignment and add a negative test for the unaligned case. I think this case is generally useful, regardless of the outcome on the spec test. For x266 specifically, let's talk offline.

craig.topper added inline comments.Apr 10 2023, 3:03 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11721	Do we need to call makeEquivalentMemoryOrdering on all the loads here?

Make equivalent memory ordering for all loads in vle case

luke marked an inline comment as done.Apr 11 2023, 1:57 AM

Harbormaster completed remote builds in B224719: Diff 512369.Apr 11 2023, 2:43 AM

craig.topper added inline comments.Apr 11 2023, 8:27 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11660	Can we move this to a function? This is a lot of code to dump into the switch. We've been pretty sloppy about this.
11662	Drop MVT from this comment. It's redundant with "types".
11675	You probably want to check this isn't an extending load either. isNormalLoad should do it I think.
11689	We might want to find common alignment instead?
11717	This also needs allowsMemoryAccessForAlignment right?
11734	Wondering if we should always do integer in case f64 vector isn't supported, but i64 is?
11778	I'm wondering if the bitcast should come before the convertFromScalableVector. That keeps convertFromScalableVector/convertToScalableVector pairs together without having bitcasts between them.

Address review comments

luke marked 6 inline comments as done.Apr 13 2023, 4:38 AM

luke added inline comments.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11734	Good point, I couldn't recreate a test case for this though because we always check that `VT` is legal first. I also tried removing the legal type check, but then we get an assertion when calling `convertFromScalableVector` with an illegal type.

luke marked 4 inline comments as done.Apr 13 2023, 4:39 AM

Harbormaster completed remote builds in B225309: Diff 513172.Apr 13 2023, 5:50 AM

craig.topper added inline comments.Apr 13 2023, 6:52 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11734	The case I was thinking was something like a concat of 2 v2f32 vectors which we would widen to a v2f64 strided load?

craig.topper added inline comments.Apr 13 2023, 6:53 PM

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11734	On a Zve64f target that doesn't support Zve64d.

Add RUN line for zv32f to test that i64 loads are used even if f64 isn't supported

luke marked 3 inline comments as done.Apr 18 2023, 6:17 AM

luke added inline comments.

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
11734	Ah that makes sense. Added a RUN line for it, should be covered by `@strided_runtime_4xv2f32`

luke marked an inline comment as done.Apr 18 2023, 6:18 AM

Harbormaster completed remote builds in B226369: Diff 514628.Apr 18 2023, 7:12 AM

LGTM

This revision is now accepted and ready to land.Apr 18 2023, 4:22 PM

Closed by commit rG18dc205112df: [RISCV] Combine concat_vectors of loads into strided loads (authored by luke). · Explain WhyApr 19 2023, 1:37 AM

This revision was automatically updated to reflect the committed changes.

luke added a commit: rG18dc205112df: [RISCV] Combine concat_vectors of loads into strided loads.

reames mentioned this in D149375: [RISCV] Introduce unaligned-vector-mem feature.Apr 27 2023, 12:22 PM

reames mentioned this in rGd636bcb6ae51: [RISCV] Introduce unaligned-vector-mem feature.Apr 28 2023, 8:28 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVISelLowering.cpp

137 lines

test/

CodeGen/

RISCV/

rvv/

fixed-vectors-strided-load-combine.ll

138 lines

Diff 513172

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,116 Lines • ▼ Show 20 Lines
if (Subtarget.hasStdExtZfhOrZfhmin())		if (Subtarget.hasStdExtZfhOrZfhmin())
setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);		setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);
if (Subtarget.hasStdExtF())		if (Subtarget.hasStdExtF())
setTargetDAGCombine({ISD::ZERO_EXTEND, ISD::FP_TO_SINT, ISD::FP_TO_UINT,		setTargetDAGCombine({ISD::ZERO_EXTEND, ISD::FP_TO_SINT, ISD::FP_TO_UINT,
ISD::FP_TO_SINT_SAT, ISD::FP_TO_UINT_SAT});		ISD::FP_TO_SINT_SAT, ISD::FP_TO_UINT_SAT});
if (Subtarget.hasVInstructions())		if (Subtarget.hasVInstructions())
setTargetDAGCombine({ISD::FCOPYSIGN, ISD::MGATHER, ISD::MSCATTER,		setTargetDAGCombine({ISD::FCOPYSIGN, ISD::MGATHER, ISD::MSCATTER,
ISD::VP_GATHER, ISD::VP_SCATTER, ISD::SRA, ISD::SRL,		ISD::VP_GATHER, ISD::VP_SCATTER, ISD::SRA, ISD::SRL,
ISD::SHL, ISD::STORE, ISD::SPLAT_VECTOR});		ISD::SHL, ISD::STORE, ISD::SPLAT_VECTOR,
		ISD::CONCAT_VECTORS});
if (Subtarget.hasVendorXTHeadMemPair())		if (Subtarget.hasVendorXTHeadMemPair())
setTargetDAGCombine({ISD::LOAD, ISD::STORE});		setTargetDAGCombine({ISD::LOAD, ISD::STORE});
if (Subtarget.useRVVForFixedLengthVectors())		if (Subtarget.useRVVForFixedLengthVectors())
setTargetDAGCombine(ISD::BITCAST);		setTargetDAGCombine(ISD::BITCAST);

setLibcallName(RTLIB::FPEXT_F16_F32, "__extendhfsf2");		setLibcallName(RTLIB::FPEXT_F16_F32, "__extendhfsf2");
setLibcallName(RTLIB::FPROUND_F32_F16, "__truncsfhf2");		setLibcallName(RTLIB::FPROUND_F32_F16, "__truncsfhf2");
}		}
▲ Show 20 Lines • Show All 9,879 Lines • ▼ Show 20 Lines	static SDValue performSELECTCombine(SDNode *N, SelectionDAG &DAG,

SDValue TrueVal = N->getOperand(1);		SDValue TrueVal = N->getOperand(1);
SDValue FalseVal = N->getOperand(2);		SDValue FalseVal = N->getOperand(2);
if (SDValue V = tryFoldSelectIntoOp(N, DAG, TrueVal, FalseVal, /Swapped/false))		if (SDValue V = tryFoldSelectIntoOp(N, DAG, TrueVal, FalseVal, /Swapped/false))
return V;		return V;
return tryFoldSelectIntoOp(N, DAG, FalseVal, TrueVal, /Swapped/true);		return tryFoldSelectIntoOp(N, DAG, FalseVal, TrueVal, /Swapped/true);
}		}

		// If we're concatenating a series of vector loads like
		// concat_vectors (load v4i8, p+0), (load v4i8, p+n), (load v4i8, p+n*2) ...
		// Then we can turn this into a strided load by widening the vector elements
		// vlse32 p, stride=n
		static SDValue performCONCAT_VECTORSCombine(SDNode *N, SelectionDAG &DAG,
		const RISCVSubtarget &Subtarget,
		const RISCVTargetLowering &TLI) {
		SDLoc DL(N);
		EVT VT = N->getValueType(0);

		// Only perform this combine on legal MVTs.
		if (!TLI.isTypeLegal(VT))
		return SDValue();

		// TODO: Potentially extend this to scalable vectors
		if (VT.isScalableVector())
		return SDValue();

		auto *BaseLd = dyn_cast<LoadSDNode>(N->getOperand(0));
		if (!BaseLd \|\| !BaseLd->isSimple() \|\| !ISD::isNormalLoad(BaseLd) \|\|
		!SDValue(BaseLd, 0).hasOneUse())
		return SDValue();

		EVT BaseLdVT = BaseLd->getValueType(0);
		SDValue BasePtr = BaseLd->getBasePtr();

		// Go through the loads and check that they're strided
		SDValue CurPtr = BasePtr;
		SDValue Stride;
		Align Align = BaseLd->getAlign();

		for (SDValue Op : N->ops().drop_front()) {
		auto *Ld = dyn_cast<LoadSDNode>(Op);
		if (!Ld \|\| !Ld->isSimple() \|\| !Op.hasOneUse() \|\|
		Ld->getChain() != BaseLd->getChain() \|\| !ISD::isNormalLoad(Ld) \|\|
		Ld->getValueType(0) != BaseLdVT)
		return SDValue();

		SDValue Ptr = Ld->getBasePtr();
		// Check that each load's pointer is (add CurPtr, Stride)
		if (Ptr.getOpcode() != ISD::ADD \|\| Ptr.getOperand(0) != CurPtr)
		return SDValue();
		SDValue Offset = Ptr.getOperand(1);
		if (!Stride)
		Stride = Offset;
		else if (Offset != Stride)
		return SDValue();

		// The common alignment is the most restrictive (smallest) of all the loads
		Align = std::min(Align, Ld->getAlign());

		CurPtr = Ptr;
		}

		// A special case is if the stride is exactly the width of one of the loads,
		// in which case it's contiguous and can be combined into a regular vle
		// without changing the element size
		if (auto *ConstStride = dyn_cast<ConstantSDNode>(Stride);
		ConstStride &&
		ConstStride->getZExtValue() == BaseLdVT.getFixedSizeInBits() / 8) {
		MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
		BaseLd->getPointerInfo(), BaseLd->getMemOperand()->getFlags(),
		VT.getStoreSize(), Align);
		// Can't do the combine if the load isn't naturally aligned with the element
		// type
		if (!TLI.allowsMemoryAccessForAlignment(*DAG.getContext(),
		DAG.getDataLayout(), VT, *MMO))
		return SDValue();

		SDValue WideLoad = DAG.getLoad(VT, DL, BaseLd->getChain(), BasePtr, MMO);
		for (SDValue Ld : N->ops())
		DAG.makeEquivalentMemoryOrdering(cast<LoadSDNode>(Ld), WideLoad);
		return WideLoad;
		}

		// Get the widened scalar type, e.g. v4i8 -> i64
		unsigned WideScalarBitWidth =
		BaseLdVT.getScalarSizeInBits() * BaseLdVT.getVectorNumElements();
		MVT WideScalarVT = MVT::getIntegerVT(WideScalarBitWidth);

		// Get the vector type for the strided load, e.g. 4 x v4i8 -> v4i64
		MVT WideVecVT = MVT::getVectorVT(WideScalarVT, N->getNumOperands());
		if (!TLI.isTypeLegal(WideVecVT))
		return SDValue();

		MVT ContainerVT = TLI.getContainerForFixedLengthVector(WideVecVT);
		SDValue VL =
		getDefaultVLOps(WideVecVT, ContainerVT, DL, DAG, Subtarget).second;
		SDVTList VTs = DAG.getVTList({ContainerVT, MVT::Other});
		SDValue IntID =
		DAG.getTargetConstant(Intrinsic::riscv_vlse, DL, Subtarget.getXLenVT());
		SDValue Ops[] = {BaseLd->getChain(),
		IntID,
		DAG.getUNDEF(ContainerVT),
		BasePtr,
		Stride,
		VL};

		uint64_t MemSize;
		if (auto *ConstStride = dyn_cast<ConstantSDNode>(Stride))
		// total size = (elsize * n) + (stride - elsize) * (n-1)
		// = elsize + stride * (n-1)
		MemSize = WideScalarVT.getSizeInBits() +
		ConstStride->getSExtValue() * (N->getNumOperands() - 1);
		else
		// If Stride isn't constant, then we can't know how much it will load
		MemSize = MemoryLocation::UnknownSize;

		MachineMemOperand *MMO = DAG.getMachineFunction().getMachineMemOperand(
		BaseLd->getPointerInfo(), BaseLd->getMemOperand()->getFlags(), MemSize,
		Align);

		// Can't do the combine if the common alignment isn't naturally aligned with
		// the new element type
		if (!TLI.allowsMemoryAccessForAlignment(*DAG.getContext(),
		DAG.getDataLayout(), WideVecVT, *MMO))
		return SDValue();

		SDValue StridedLoad = DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, VTs,
		Ops, WideVecVT, MMO);
		for (SDValue Ld : N->ops())
		DAG.makeEquivalentMemoryOrdering(cast<LoadSDNode>(Ld), StridedLoad);

		// Note: Perform the bitcast before the convertFromScalableVector so we have
		// balanced pairs of convertFromScalable/convertToScalable
		SDValue Res = DAG.getBitcast(
		TLI.getContainerForFixedLengthVector(VT.getSimpleVT()), StridedLoad);
		return convertFromScalableVector(VT, Res, DAG, Subtarget);
		}

SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,		SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;

// Helper to call SimplifyDemandedBits on an operand of N where only some low		// Helper to call SimplifyDemandedBits on an operand of N where only some low
// bits are demanded. N will be added to the Worklist if it was not deleted.		// bits are demanded. N will be added to the Worklist if it was not deleted.
// Caller should return SDValue(N, 0) if this returns true.		// Caller should return SDValue(N, 0) if this returns true.
auto SimplifyDemandedLowBitsHelper = [&](unsigned OpNo, unsigned LowBits) {		auto SimplifyDemandedLowBitsHelper = [&](unsigned OpNo, unsigned LowBits) {
▲ Show 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	case ISD::SPLAT_VECTOR: {
// Only perform this combine on legal MVT types.		// Only perform this combine on legal MVT types.
if (!isTypeLegal(VT))		if (!isTypeLegal(VT))
break;		break;
if (auto Gather = matchSplatAsGather(N->getOperand(0), VT.getSimpleVT(), N,		if (auto Gather = matchSplatAsGather(N->getOperand(0), VT.getSimpleVT(), N,
DAG, Subtarget))		DAG, Subtarget))
return Gather;		return Gather;
break;		break;
}		}
		case ISD::CONCAT_VECTORS:
		if (SDValue V = performCONCAT_VECTORSCombine(N, DAG, Subtarget, *this))
		craig.topperUnsubmitted Done Reply Inline Actions Can we move this to a function? This is a lot of code to dump into the switch. We've been pretty sloppy about this. craig.topper: Can we move this to a function? This is a lot of code to dump into the switch. We've been…
		return V;
		break;
		craig.topperUnsubmitted Done Reply Inline Actions Drop MVT from this comment. It's redundant with "types". craig.topper: Drop MVT from this comment. It's redundant with "types".
case RISCVISD::VMV_V_X_VL: {		case RISCVISD::VMV_V_X_VL: {
// Tail agnostic VMV.V.X only demands the vector element bitwidth from the		// Tail agnostic VMV.V.X only demands the vector element bitwidth from the
// scalar input.		// scalar input.
unsigned ScalarSize = N->getOperand(1).getValueSizeInBits();		unsigned ScalarSize = N->getOperand(1).getValueSizeInBits();
unsigned EltWidth = N->getValueType(0).getScalarSizeInBits();		unsigned EltWidth = N->getValueType(0).getScalarSizeInBits();
if (ScalarSize > EltWidth && N->getOperand(0).isUndef())		if (ScalarSize > EltWidth && N->getOperand(0).isUndef())
if (SimplifyDemandedLowBitsHelper(1, EltWidth))		if (SimplifyDemandedLowBitsHelper(1, EltWidth))
return SDValue(N, 0);		return SDValue(N, 0);

break;		break;
}		}
case RISCVISD::VFMV_S_F_VL: {		case RISCVISD::VFMV_S_F_VL: {
SDValue Src = N->getOperand(1);		SDValue Src = N->getOperand(1);
		craig.topperUnsubmitted Done Reply Inline Actions You probably want to check this isn't an extending load either. isNormalLoad should do it I think. craig.topper: You probably want to check this isn't an extending load either. isNormalLoad should do it I…
// Try to remove vector->scalar->vector if the scalar->vector is inserting		// Try to remove vector->scalar->vector if the scalar->vector is inserting
// into an undef vector.		// into an undef vector.
// TODO: Could use a vslide or vmv.v.v for non-undef.		// TODO: Could use a vslide or vmv.v.v for non-undef.
if (N->getOperand(0).isUndef() &&		if (N->getOperand(0).isUndef() &&
Src.getOpcode() == ISD::EXTRACT_VECTOR_ELT &&		Src.getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
isNullConstant(Src.getOperand(1)) &&		isNullConstant(Src.getOperand(1)) &&
Src.getOperand(0).getValueType().isScalableVector()) {		Src.getOperand(0).getValueType().isScalableVector()) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
EVT SrcVT = Src.getOperand(0).getValueType();		EVT SrcVT = Src.getOperand(0).getValueType();
assert(SrcVT.getVectorElementType() == VT.getVectorElementType());		assert(SrcVT.getVectorElementType() == VT.getVectorElementType());
		craig.topperUnsubmitted Done Reply Inline Actions We need all the input loads to have the same chain input craig.topper: We need all the input loads to have the same chain input
// Widths match, just return the original vector.		// Widths match, just return the original vector.
if (SrcVT == VT)		if (SrcVT == VT)
return Src.getOperand(0);		return Src.getOperand(0);
// TODO: Use insert_subvector/extract_subvector to change widen/narrow?		// TODO: Use insert_subvector/extract_subvector to change widen/narrow?
		lukeAuthorUnsubmitted Done Reply Inline Actions This currently only works for strides that use an incremental pattern, e.g. `p, (+ p stride), (+ (+ p stride) stride), ...` A strided load could also be represented with a pointer vector built by stepvector + multiply by stride luke: This currently only works for strides that use an incremental pattern, e.g. `p, (+ p stride)…
		reamesUnsubmitted Done Reply Inline Actions Do you have an example here? I'd expect the load pointer operand to be scalar and thus the form you describe would likely have been scalarized. reames: Do you have an example here? I'd expect the load pointer operand to be scalar and thus the…
		lukeAuthorUnsubmitted Done Reply Inline Actions Whoops, not a stepvector, what I meant to say was: %p1offset = mul i64 %stride, 1 %p1 = getelementptr i8, ptr %p, i64 %p1offset %p2offset = mul i64 %stride, 2 %p2 = getelementptr i8, ptr %p, i64 %p2offset %p3offset = mul i64 %stride, 3 %p3 = getelementptr i8, ptr %p, i64 %p3offset With that said I haven't seen this being emitted so far, but I'll keep an eye out to see if this or other patterns show up in the wild. luke: Whoops, not a stepvector, what I meant to say was: ``` %p1offset = mul i64 %stride, 1 %p1 =…
		craig.topperUnsubmitted Done Reply Inline Actions We might want to find common alignment instead? craig.topper: We might want to find common alignment instead?
}		}
break;		break;
}		}
case ISD::INTRINSIC_WO_CHAIN: {		case ISD::INTRINSIC_WO_CHAIN: {
unsigned IntNo = N->getConstantOperandVal(0);		unsigned IntNo = N->getConstantOperandVal(0);
switch (IntNo) {		switch (IntNo) {
// By default we do not combine any intrinsic.		// By default we do not combine any intrinsic.
		lukeAuthorUnsubmitted Done Reply Inline Actions Do we need to check the memory VT as well? luke: Do we need to check the memory VT as well?
default:		default:
return SDValue();		return SDValue();
case Intrinsic::riscv_vcpop:		case Intrinsic::riscv_vcpop:
case Intrinsic::riscv_vcpop_mask:		case Intrinsic::riscv_vcpop_mask:
case Intrinsic::riscv_vfirst:		case Intrinsic::riscv_vfirst:
case Intrinsic::riscv_vfirst_mask: {		case Intrinsic::riscv_vfirst_mask: {
SDValue VL = N->getOperand(2);		SDValue VL = N->getOperand(2);
if (IntNo == Intrinsic::riscv_vcpop_mask \|\|		if (IntNo == Intrinsic::riscv_vcpop_mask \|\|
IntNo == Intrinsic::riscv_vfirst_mask)		IntNo == Intrinsic::riscv_vfirst_mask)
VL = N->getOperand(3);		VL = N->getOperand(3);
if (!isNullConstant(VL))		if (!isNullConstant(VL))
return SDValue();		return SDValue();
// If VL is 0, vcpop -> li 0, vfirst -> li -1.		// If VL is 0, vcpop -> li 0, vfirst -> li -1.
SDLoc DL(N);		SDLoc DL(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if (IntNo == Intrinsic::riscv_vfirst \|\|		if (IntNo == Intrinsic::riscv_vfirst \|\|
IntNo == Intrinsic::riscv_vfirst_mask)		IntNo == Intrinsic::riscv_vfirst_mask)
return DAG.getConstant(-1, DL, VT);		return DAG.getConstant(-1, DL, VT);
return DAG.getConstant(0, DL, VT);		return DAG.getConstant(0, DL, VT);
}		}
}		}
		craig.topperUnsubmitted Done Reply Inline Actions This also needs allowsMemoryAccessForAlignment right? craig.topper: This also needs allowsMemoryAccessForAlignment right?
}		}
case ISD::BITCAST: {		case ISD::BITCAST: {
		lukeAuthorUnsubmitted Done Reply Inline Actions This bit could be target agnostic. I have a copy of a patch locally that puts this in DAGCombiner if that would be a better place luke: This bit could be target agnostic. I have a copy of a patch locally that puts this in…
		reamesUnsubmitted Done Reply Inline Actions Sounds like a good follow up to me. reames: Sounds like a good follow up to me.
assert(Subtarget.useRVVForFixedLengthVectors());		assert(Subtarget.useRVVForFixedLengthVectors());
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
		craig.topperUnsubmitted Done Reply Inline Actions Do we need to call makeEquivalentMemoryOrdering on all the loads here? craig.topper: Do we need to call makeEquivalentMemoryOrdering on all the loads here?
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
EVT SrcVT = N0.getValueType();		EVT SrcVT = N0.getValueType();
// If this is a bitcast between a MVT::v4i1/v2i1/v1i1 and an illegal integer		// If this is a bitcast between a MVT::v4i1/v2i1/v1i1 and an illegal integer
// type, widen both sides to avoid a trip through memory.		// type, widen both sides to avoid a trip through memory.
if ((SrcVT == MVT::v1i1 \|\| SrcVT == MVT::v2i1 \|\| SrcVT == MVT::v4i1) &&		if ((SrcVT == MVT::v1i1 \|\| SrcVT == MVT::v2i1 \|\| SrcVT == MVT::v4i1) &&
VT.isScalarInteger()) {		VT.isScalarInteger()) {
unsigned NumConcats = 8 / SrcVT.getVectorNumElements();		unsigned NumConcats = 8 / SrcVT.getVectorNumElements();
SmallVector<SDValue, 4> Ops(NumConcats, DAG.getUNDEF(SrcVT));		SmallVector<SDValue, 4> Ops(NumConcats, DAG.getUNDEF(SrcVT));
Ops[0] = N0;		Ops[0] = N0;
SDLoc DL(N);		SDLoc DL(N);
N0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, MVT::v8i1, Ops);		N0 = DAG.getNode(ISD::CONCAT_VECTORS, DL, MVT::v8i1, Ops);
N0 = DAG.getBitcast(MVT::i8, N0);		N0 = DAG.getBitcast(MVT::i8, N0);
return DAG.getNode(ISD::TRUNCATE, DL, VT, N0);		return DAG.getNode(ISD::TRUNCATE, DL, VT, N0);
		craig.topperUnsubmitted Done Reply Inline Actions Wondering if we should always do integer in case f64 vector isn't supported, but i64 is? craig.topper: Wondering if we should always do integer in case f64 vector isn't supported, but i64 is?
		lukeAuthorUnsubmitted Done Reply Inline Actions Good point, I couldn't recreate a test case for this though because we always check that `VT` is legal first. I also tried removing the legal type check, but then we get an assertion when calling `convertFromScalableVector` with an illegal type. luke: Good point, I couldn't recreate a test case for this though because we always check that `VT`…
		craig.topperUnsubmitted Done Reply Inline Actions The case I was thinking was something like a concat of 2 v2f32 vectors which we would widen to a v2f64 strided load? craig.topper: The case I was thinking was something like a concat of 2 v2f32 vectors which we would widen to…
		craig.topperUnsubmitted Done Reply Inline Actions On a Zve64f target that doesn't support Zve64d. craig.topper: On a Zve64f target that doesn't support Zve64d.
		lukeAuthorUnsubmitted Done Reply Inline Actions Ah that makes sense. Added a RUN line for it, should be covered by `@strided_runtime_4xv2f32` luke: Ah that makes sense. Added a RUN line for it, should be covered by `@strided_runtime_4xv2f32`
}		}

return SDValue();		return SDValue();
}		}
}		}

return SDValue();		return SDValue();
}		}

bool RISCVTargetLowering::isDesirableToCommuteWithShift(		bool RISCVTargetLowering::isDesirableToCommuteWithShift(
const SDNode *N, CombineLevel Level) const {		const SDNode *N, CombineLevel Level) const {
assert((N->getOpcode() == ISD::SHL \|\| N->getOpcode() == ISD::SRA \|\|		assert((N->getOpcode() == ISD::SHL \|\| N->getOpcode() == ISD::SRA \|\|
N->getOpcode() == ISD::SRL) &&		N->getOpcode() == ISD::SRL) &&
"Expected shift op");		"Expected shift op");

// The following folds are only desirable if `(OP _, c1 << c2)` can be		// The following folds are only desirable if `(OP _, c1 << c2)` can be
// materialised in fewer instructions than `(OP _, c1)`:		// materialised in fewer instructions than `(OP _, c1)`:
//		//
// (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2)		// (shl (add x, c1), c2) -> (add (shl x, c2), c1 << c2)
		reamesUnsubmitted Done Reply Inline Actions The size on this memory operand is wrong. The memory region accessed isn't the size of the result vector, it's the entire stride region which can be much larger. You can probably just use an unknown size here - unless we already have the utility code for this somewhere else. reames: The size on this memory operand is wrong. The memory region accessed isn't the size of the…
// (shl (or x, c1), c2) -> (or (shl x, c2), c1 << c2)		// (shl (or x, c1), c2) -> (or (shl x, c2), c1 << c2)
		craig.topperUnsubmitted Done Reply Inline Actions We have to call makeEquivalentMemoryOrdering for all of the loads that were combined. craig.topper: We have to call makeEquivalentMemoryOrdering for all of the loads that were combined.
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
EVT Ty = N0.getValueType();		EVT Ty = N0.getValueType();
if (Ty.isScalarInteger() &&		if (Ty.isScalarInteger() &&
		craig.topperUnsubmitted Done Reply Inline Actions No need for break after return craig.topper: No need for break after return
(N0.getOpcode() == ISD::ADD \|\| N0.getOpcode() == ISD::OR)) {		(N0.getOpcode() == ISD::ADD \|\| N0.getOpcode() == ISD::OR)) {
auto *C1 = dyn_cast<ConstantSDNode>(N0->getOperand(1));		auto *C1 = dyn_cast<ConstantSDNode>(N0->getOperand(1));
auto *C2 = dyn_cast<ConstantSDNode>(N->getOperand(1));		auto *C2 = dyn_cast<ConstantSDNode>(N->getOperand(1));
if (C1 && C2) {		if (C1 && C2) {
const APInt &C1Int = C1->getAPIntValue();		const APInt &C1Int = C1->getAPIntValue();
APInt ShiftedC1Int = C1Int << C2->getAPIntValue();		APInt ShiftedC1Int = C1Int << C2->getAPIntValue();

// We can materialise `c1 << c2` into an add immediate, so it's "free",		// We can materialise `c1 << c2` into an add immediate, so it's "free",
// and the combine should happen, to potentially allow further combines		// and the combine should happen, to potentially allow further combines
// later.		// later.
if (ShiftedC1Int.getSignificantBits() <= 64 &&		if (ShiftedC1Int.getSignificantBits() <= 64 &&
isLegalAddImmediate(ShiftedC1Int.getSExtValue()))		isLegalAddImmediate(ShiftedC1Int.getSExtValue()))
return true;		return true;

		lukeAuthorUnsubmitted Done Reply Inline Actions Is it legal to increase the alignment here? E.g. for these loads %0 = load <4 x i8>, ptr %pix1, align 1 %add.ptr = getelementptr inbounds i8, ptr %pix1, i64 %idx.ext %2 = load <4 x i8>, ptr %add.ptr, align 1 Can we use an align of 4 * 1: %0 = call <2 x i32> @llvm.riscv.strided.load ptr %pix1, i64 %idx.ext, align 4 luke: Is it legal to increase the alignment here? E.g. for these loads ``` %0 = load <4 x i8>, ptr…
		lukeAuthorUnsubmitted Done Reply Inline Actions I have a feeling the answer is no, which would mean that we can't combine this in x264 SAD: c #include <stdint.h> #include <stdlib.h> typedef uint8_t pixel; #define PIXEL_SAD_C( name, lx, ly ) \ int name( pixel pix1, intptr_t i_stride_pix1, \ pixel pix2, intptr_t i_stride_pix2 ) \ { \ int i_sum = 0; \ for( int y = 0; y < ly; y++ ) \ { \ for( int x = 0; x < lx; x++ ) \ { \ i_sum += abs( pix1[x] - pix2[x] ); \ } \ pix1 += i_stride_pix1; \ pix2 += i_stride_pix2; \ } \ return i_sum; \ } PIXEL_SAD_C(x264_pixel_sad_4x4, 4, 4) There's no guarantee here that `pix1`/`pix2`/`i_stride_pix1`/`i_stride_pix2` are word aligned so we can't use vlse32. Unless we know it has fast unaligned access? luke: I have a feeling the answer is no, which would mean that we can't combine this in x264 SAD…
// We can materialise `c1` in an add immediate, so it's "free", and the		// We can materialise `c1` in an add immediate, so it's "free", and the
// combine should be prevented.		// combine should be prevented.
if (C1Int.getSignificantBits() <= 64 &&		if (C1Int.getSignificantBits() <= 64 &&
isLegalAddImmediate(C1Int.getSExtValue()))		isLegalAddImmediate(C1Int.getSExtValue()))
return false;		return false;

// Neither constant will fit into an immediate, so find materialisation		// Neither constant will fit into an immediate, so find materialisation
		craig.topperUnsubmitted Done Reply Inline Actions I'm wondering if the bitcast should come before the convertFromScalableVector. That keeps convertFromScalableVector/convertToScalableVector pairs together without having bitcasts between them. craig.topper: I'm wondering if the bitcast should come before the convertFromScalableVector. That keeps…
// costs.		// costs.
int C1Cost = RISCVMatInt::getIntMatCost(C1Int, Ty.getSizeInBits(),		int C1Cost = RISCVMatInt::getIntMatCost(C1Int, Ty.getSizeInBits(),
Subtarget.getFeatureBits(),		Subtarget.getFeatureBits(),
/CompressionCost/true);		/CompressionCost/true);
int ShiftedC1Cost = RISCVMatInt::getIntMatCost(		int ShiftedC1Cost = RISCVMatInt::getIntMatCost(
ShiftedC1Int, Ty.getSizeInBits(), Subtarget.getFeatureBits(),		ShiftedC1Int, Ty.getSizeInBits(), Subtarget.getFeatureBits(),
/CompressionCost/true);		/CompressionCost/true);

▲ Show 20 Lines • Show All 3,824 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
; RUN: llc -mtriple=riscv32 -mattr=+v,+zfh,+experimental-zvfh -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=CHECK,RV32		; RUN: llc -mtriple=riscv32 -mattr=+v,+zfh,+experimental-zvfh -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=CHECK,RV32
; RUN: llc -mtriple=riscv64 -mattr=+v,+zfh,+experimental-zvfh -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=CHECK,RV64		; RUN: llc -mtriple=riscv64 -mattr=+v,+zfh,+experimental-zvfh -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=CHECK,RV64

; The two loads are contigous and should be folded into one		; The two loads are contigous and should be folded into one
define void @widen_2xv4i16(ptr %x, ptr %z) {		define void @widen_2xv4i16(ptr %x, ptr %z) {
; CHECK-LABEL: widen_2xv4i16:		; CHECK-LABEL: widen_2xv4i16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: addi a0, a0, 8
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4		; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 8		%b.gep = getelementptr i8, ptr %x, i64 8
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
Show All 30 Lines
; RV64-NEXT: vsetivli zero, 12, e16, m2, tu, ma		; RV64-NEXT: vsetivli zero, 12, e16, m2, tu, ma
; RV64-NEXT: vslideup.vi v8, v12, 8		; RV64-NEXT: vslideup.vi v8, v12, 8
; RV64-NEXT: vsetivli zero, 1, e64, m2, ta, ma		; RV64-NEXT: vsetivli zero, 1, e64, m2, ta, ma
; RV64-NEXT: vslidedown.vi v10, v8, 2		; RV64-NEXT: vslidedown.vi v10, v8, 2
; RV64-NEXT: addi a0, a1, 16		; RV64-NEXT: addi a0, a1, 16
; RV64-NEXT: vse64.v v10, (a0)		; RV64-NEXT: vse64.v v10, (a0)
; RV64-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; RV64-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; RV64-NEXT: vse16.v v8, (a1)		; RV64-NEXT: vse16.v v8, (a1)
; RV64-NEXT: ret		; RV64-NEXT: ret
		lukeAuthorUnsubmitted Done Reply Inline Actions This test case doesn't produce a `concat_vector v0, v1, v2`, it's some other pattern that doesn't get picked up by the combine. Should fix it at some point luke: This test case doesn't produce a `concat_vector v0, v1, v2`, it's some other pattern that…
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 8		%b.gep = getelementptr i8, ptr %x, i64 8
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 8		%c.gep = getelementptr i8, ptr %b.gep, i64 8
%c = load <4 x i16>, ptr %c.gep		%c = load <4 x i16>, ptr %c.gep
%d.0 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%d.0 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%d.1 = shufflevector <4 x i16> %c, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>		%d.1 = shufflevector <4 x i16> %c, <4 x i16> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
%d.2 = shufflevector <8 x i16> %d.0, <8 x i16> %d.1, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>		%d.2 = shufflevector <8 x i16> %d.0, <8 x i16> %d.1, <12 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11>
store <12 x i16> %d.2, ptr %z		store <12 x i16> %d.2, ptr %z
ret void		ret void
}		}

define void @widen_4xv4i16(ptr %x, ptr %z) {		define void @widen_4xv4i16(ptr %x, ptr %z) {
; CHECK-LABEL: widen_4xv4i16:		; CHECK-LABEL: widen_4xv4i16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: addi a2, a0, 8
; CHECK-NEXT: vle16.v v10, (a2)
; CHECK-NEXT: addi a2, a0, 16
; CHECK-NEXT: vle16.v v12, (a2)
; CHECK-NEXT: addi a0, a0, 24
; CHECK-NEXT: vle16.v v14, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v10, 4
; CHECK-NEXT: vsetivli zero, 12, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v12, 8
; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma		; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v14, 12		; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 8		%b.gep = getelementptr i8, ptr %x, i64 8
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 8		%c.gep = getelementptr i8, ptr %b.gep, i64 8
%c = load <4 x i16>, ptr %c.gep		%c = load <4 x i16>, ptr %c.gep
%d.gep = getelementptr i8, ptr %c.gep, i64 8		%d.gep = getelementptr i8, ptr %c.gep, i64 8
%d = load <4 x i16>, ptr %d.gep		%d = load <4 x i16>, ptr %d.gep
%e.0 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.0 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%e.1 = shufflevector <4 x i16> %c, <4 x i16> %d, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.1 = shufflevector <4 x i16> %c, <4 x i16> %d, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
store <16 x i16> %e.2, ptr %z		store <16 x i16> %e.2, ptr %z
ret void		ret void
}		}

; Should be a strided load - with type coercion to i64		; Should be a strided load - with type coercion to i64
define void @strided_constant(ptr %x, ptr %z) {		define void @strided_constant(ptr %x, ptr %z) {
; CHECK-LABEL: strided_constant:		; CHECK-LABEL: strided_constant:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: li a2, 16
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: addi a0, a0, 16		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 16		%b.gep = getelementptr i8, ptr %x, i64 16
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Should be a strided load		; Should be a strided load
define void @strided_constant_64(ptr %x, ptr %z) {		define void @strided_constant_64(ptr %x, ptr %z) {
; CHECK-LABEL: strided_constant_64:		; CHECK-LABEL: strided_constant_64:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: li a2, 64
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: addi a0, a0, 64		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 64		%b.gep = getelementptr i8, ptr %x, i64 64
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Vector is too large to fit into a single strided load		; Vector is too large to fit into a single strided load
define void @strided_constant_v4i32(ptr %x, ptr %z) {		define void @strided_constant_v4i32(ptr %x, ptr %z) {
		reamesUnsubmitted Done Reply Inline Actions The comment and the code appear out of sync here. I think the code is correct. reames: The comment and the code appear out of sync here. I think the code is correct.
		lukeAuthorUnsubmitted Done Reply Inline Actions Whoops, this is testing the wrong thing. The stride here should be > 16 to test the case where the combined element size > the max EEW luke: Whoops, this is testing the wrong thing. The stride here should be > 16 to test the case where…
; CHECK-LABEL: strided_constant_v4i32:		; CHECK-LABEL: strided_constant_v4i32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
; CHECK-NEXT: vle32.v v8, (a0)		; CHECK-NEXT: vle32.v v8, (a0)
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: vle32.v v10, (a0)		; CHECK-NEXT: vle32.v v10, (a0)
; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v10, 4		; CHECK-NEXT: vslideup.vi v8, v10, 4
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
store <16 x i16> %e.2, ptr %z		store <16 x i16> %e.2, ptr %z
ret void		ret void
}		}

define void @strided_runtime(ptr %x, ptr %z, i64 %s) {		define void @strided_runtime(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_runtime:		; CHECK-LABEL: strided_runtime:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

define void @strided_runtime_4xv4i16(ptr %x, ptr %z, i64 %s) {		define void @strided_runtime_4xv4i16(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_runtime_4xv4i16:		; CHECK-LABEL: strided_runtime_4xv4i16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v10, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v12, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v14, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v10, 4
; CHECK-NEXT: vsetivli zero, 12, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v12, 8
; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v14, 12
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep		%b = load <4 x i16>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 %s		%c.gep = getelementptr i8, ptr %b.gep, i64 %s
%c = load <4 x i16>, ptr %c.gep		%c = load <4 x i16>, ptr %c.gep
%d.gep = getelementptr i8, ptr %c.gep, i64 %s		%d.gep = getelementptr i8, ptr %c.gep, i64 %s
%d = load <4 x i16>, ptr %d.gep		%d = load <4 x i16>, ptr %d.gep
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	; RV64-NEXT: ret
%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%e.2 = shufflevector <8 x i16> %e.0, <8 x i16> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
store <16 x i16> %e.2, ptr %z		store <16 x i16> %e.2, ptr %z
ret void		ret void
}		}

define void @strided_runtime_4xv4f16(ptr %x, ptr %z, i64 %s) {		define void @strided_runtime_4xv4f16(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_runtime_4xv4f16:		; CHECK-LABEL: strided_runtime_4xv4f16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v10, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v12, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v14, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v10, 4
; CHECK-NEXT: vsetivli zero, 12, e16, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v12, 8
; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v14, 12
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x half>, ptr %x		%a = load <4 x half>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x half>, ptr %b.gep		%b = load <4 x half>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 %s		%c.gep = getelementptr i8, ptr %b.gep, i64 %s
%c = load <4 x half>, ptr %c.gep		%c = load <4 x half>, ptr %c.gep
%d.gep = getelementptr i8, ptr %c.gep, i64 %s		%d.gep = getelementptr i8, ptr %c.gep, i64 %s
%d = load <4 x half>, ptr %d.gep		%d = load <4 x half>, ptr %d.gep
%e.0 = shufflevector <4 x half> %a, <4 x half> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.0 = shufflevector <4 x half> %a, <4 x half> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%e.1 = shufflevector <4 x half> %c, <4 x half> %d, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.1 = shufflevector <4 x half> %c, <4 x half> %d, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%e.2 = shufflevector <8 x half> %e.0, <8 x half> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>		%e.2 = shufflevector <8 x half> %e.0, <8 x half> %e.1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
store <16 x half> %e.2, ptr %z		store <16 x half> %e.2, ptr %z
ret void		ret void
}		}

define void @strided_runtime_4xv2f32(ptr %x, ptr %z, i64 %s) {		define void @strided_runtime_4xv2f32(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_runtime_4xv2f32:		; CHECK-LABEL: strided_runtime_4xv2f32:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma
; CHECK-NEXT: vle32.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle32.v v10, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle32.v v12, (a0)
; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle32.v v14, (a0)
; CHECK-NEXT: vsetivli zero, 4, e32, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v10, 2
; CHECK-NEXT: vsetivli zero, 6, e32, m2, tu, ma
; CHECK-NEXT: vslideup.vi v8, v12, 4
; CHECK-NEXT: vsetivli zero, 8, e32, m2, ta, ma
; CHECK-NEXT: vslideup.vi v8, v14, 6
; CHECK-NEXT: vse32.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <2 x float>, ptr %x		%a = load <2 x float>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <2 x float>, ptr %b.gep		%b = load <2 x float>, ptr %b.gep
%c.gep = getelementptr i8, ptr %b.gep, i64 %s		%c.gep = getelementptr i8, ptr %b.gep, i64 %s
%c = load <2 x float>, ptr %c.gep		%c = load <2 x float>, ptr %c.gep
%d.gep = getelementptr i8, ptr %c.gep, i64 %s		%d.gep = getelementptr i8, ptr %c.gep, i64 %s
%d = load <2 x float>, ptr %d.gep		%d = load <2 x float>, ptr %d.gep
%e.0 = shufflevector <2 x float> %a, <2 x float> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		%e.0 = shufflevector <2 x float> %a, <2 x float> %b, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%e.1 = shufflevector <2 x float> %c, <2 x float> %d, <4 x i32> <i32 0, i32 1, i32 2, i32 3>		%e.1 = shufflevector <2 x float> %c, <2 x float> %d, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%e.2 = shufflevector <4 x float> %e.0, <4 x float> %e.1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%e.2 = shufflevector <4 x float> %e.0, <4 x float> %e.1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x float> %e.2, ptr %z		store <8 x float> %e.2, ptr %z
ret void		ret void
}		}

; Shouldn't be combined because the resulting load would not be aligned		; Shouldn't be combined because the resulting load would not be aligned
define void @strided_unaligned(ptr %x, ptr %z, i64 %s) {		define void @strided_unaligned(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_unaligned:		; CHECK-LABEL: strided_unaligned:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vle8.v v8, (a0)		; CHECK-NEXT: vle8.v v8, (a0)
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle8.v v9, (a0)		; CHECK-NEXT: vle8.v v9, (a0)
; CHECK-NEXT: vslideup.vi v8, v9, 4		; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x, align 1		%a = load <4 x i16>, ptr %x, align 1
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep, align 1		%b = load <4 x i16>, ptr %b.gep, align 1
		lukeAuthorUnsubmitted Done Reply Inline Actions @reames FYI, this mirrors the loads from SLP in x264 SAD which have an alignment of 1 luke: @reames FYI, this mirrors the loads from SLP in x264 SAD which have an alignment of 1
		reamesUnsubmitted Done Reply Inline Actions For the purpose of this patch, please update the tests to have the required alignment and add a negative test for the unaligned case. I think this case is generally useful, regardless of the outcome on the spec test. For x266 specifically, let's talk offline. reames: For the purpose of this patch, please update the tests to have the required alignment and add a…
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Shouldn't be combined because the loads have different alignments		; Should use the most restrictive common alignment
define void @strided_mismatched_alignments(ptr %x, ptr %z, i64 %s) {		define void @strided_mismatched_alignments(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_mismatched_alignments:		; CHECK-LABEL: strided_mismatched_alignments:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x, align 8		%a = load <4 x i16>, ptr %x, align 8
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep, align 16		%b = load <4 x i16>, ptr %b.gep, align 16
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

define void @strided_ok_alignments_8(ptr %x, ptr %z, i64 %s) {		define void @strided_ok_alignments_8(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_ok_alignments_8:		; CHECK-LABEL: strided_ok_alignments_8:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x, align 8		%a = load <4 x i16>, ptr %x, align 8
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep, align 8		%b = load <4 x i16>, ptr %b.gep, align 8
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

define void @strided_ok_alignments_16(ptr %x, ptr %z, i64 %s) {		define void @strided_ok_alignments_16(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_ok_alignments_16:		; CHECK-LABEL: strided_ok_alignments_16:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vlse64.v v8, (a0), a2
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: vse64.v v8, (a1)
; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x, align 16		%a = load <4 x i16>, ptr %x, align 16
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load <4 x i16>, ptr %b.gep, align 16		%b = load <4 x i16>, ptr %b.gep, align 16
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Shouldn't be combined because one of the loads is not simple		; Shouldn't be combined because one of the loads is not simple
define void @strided_non_simple_load(ptr %x, ptr %z, i64 %s) {		define void @strided_non_simple_load(ptr %x, ptr %z, i64 %s) {
; CHECK-LABEL: strided_non_simple_load:		; CHECK-LABEL: strided_non_simple_load:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v8, (a0)		; CHECK-NEXT: vle16.v v8, (a0)
; CHECK-NEXT: add a0, a0, a2		; CHECK-NEXT: add a0, a0, a2
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v8, v9, 4		; CHECK-NEXT: vslideup.vi v8, v9, 4
; CHECK-NEXT: vse16.v v8, (a1)		; CHECK-NEXT: vse16.v v8, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%b.gep = getelementptr i8, ptr %x, i64 %s		%b.gep = getelementptr i8, ptr %x, i64 %s
%b = load volatile <4 x i16>, ptr %b.gep		%b = load volatile <4 x i16>, ptr %b.gep
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}

; Shouldn't be combined because one of the operands is not a load		; Shouldn't be combined because one of the operands is not a load
define void @strided_non_load(ptr %x, ptr %z, <4 x i16> %b) {		define void @strided_non_load(ptr %x, ptr %z, <4 x i16> %b) {
; CHECK-LABEL: strided_non_load:		; CHECK-LABEL: strided_non_load:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma		; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma
; CHECK-NEXT: vle16.v v9, (a0)		; CHECK-NEXT: vle16.v v9, (a0)
; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma		; CHECK-NEXT: vsetivli zero, 8, e16, m1, ta, ma
; CHECK-NEXT: vslideup.vi v9, v8, 4		; CHECK-NEXT: vslideup.vi v9, v8, 4
; CHECK-NEXT: vse16.v v9, (a1)		; CHECK-NEXT: vse16.v v9, (a1)
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%a = load <4 x i16>, ptr %x		%a = load <4 x i16>, ptr %x
%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%c = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
store <8 x i16> %c, ptr %z		store <8 x i16> %c, ptr %z
ret void		ret void
}		}
		reamesUnsubmitted Done Reply Inline Actions Can you add a couple of negative tests? A case where the stride is not equal. (i.e. we recognize stride mismatch.) A case where the resulting type is not legal. A case with a non-simple load. A case where one of the operands is not a load. reames: Can you add a couple of negative tests? 1) A case where the stride is not equal. (i.e. we…
		lukeAuthorUnsubmitted Done Reply Inline Actions I'm struggling to think of ways to get an illegal result type. We can produce an illegal type when we concat an irregular number of vectors so we get something like `3 x v4i16 -> v12i16`, which is covered by `widen_3xv4i16`, but in that case we never actually do the combine in the first place since it doesn't have a concat_vector that we can match on. `strided_constant_v4i32` handles the case where we would have tried to a strided load of `v2i128`. Did you have a specific example in mind? luke: I'm struggling to think of ways to get an illegal result type. We can produce an illegal type…

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Combine concat_vectors of loads into strided loadsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 513172

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll

[RISCV] Combine concat_vectors of loads into strided loads
ClosedPublic