This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
-
RISCVISelLowering.cpp
-
test/CodeGen/RISCV/rvv/
-
CodeGen/
-
RISCV/
-
rvv/
2/2
fixed-vector-strided-load-store-asm.ll

Differential D137931

[RISCV] Don't use zero-stride vector load for gather if not optimized
ClosedPublic

Authored by • pcwang-thead on Nov 14 2022, 3:29 AM.

Download Raw Diff

Details

Reviewers

reames
craig.topper

Commits

rGa214c521f876: [RISCV] Don't use zero-stride vector load for gather if not optimized

Summary

We may form a zero-stride vector load when lowering gather to strided
load. As what D137699 has done, we use load+splat for this form if
there is no optimized implementation.
We restrict this to unmasked loads currently in consideration of the
complexity of hanlding all falses masks.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

• pcwang-thead created this revision.Nov 14 2022, 3:29 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2022, 3:29 AM

Herald added subscribers: sunshaoce, VincentWu, StephenFan and 29 others. · View Herald Transcript

• pcwang-thead requested review of this revision.Nov 14 2022, 3:29 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2022, 3:29 AM

Herald added subscribers: llvm-commits, eopXD, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B197491: Diff 475095.Nov 14 2022, 4:11 AM

This isn't correct. The strided load can be masked. For the case where all lanes are masked off, executing the scalar load is unsound and could introduce a fault.

You could allow any mask where you can prove at least one lane active, or make the scalar store conditional, but there's a bunch of complexity there. As a starting point, I suggest you restrict your transformation to when the instruction is unmasked.

This revision now requires changes to proceed.Nov 14 2022, 12:28 PM

Restrict this to unmasked loads only.

• pcwang-thead edited the summary of this revision. (Show Details)Nov 14 2022, 11:58 PM

In D137931#3925757, @reames wrote:

This isn't correct. The strided load can be masked. For the case where all lanes are masked off, executing the scalar load is unsound and could introduce a fault.

You could allow any mask where you can prove at least one lane active, or make the scalar store conditional, but there's a bunch of complexity there. As a starting point, I suggest you restrict your transformation to when the instruction is unmasked.

Thanks. I didn't consider these situations before.

I tried to transform masked loads to scalar load + splat + vrgather, but it may not worth doing it as three instructions are needed. And for runtime all-zeros-mask, we may take some costs to handle it. So, I think we can only handle unmasked loads for now. :-)

Harbormaster completed remote builds in B197682: Diff 475353.Nov 15 2022, 12:47 AM

LGTM

llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store-asm.ll
230	Hm, this test case is interesting. This is a case where even with a fast broadcast load, putting the value into a scalar allows the splat to be folded into the using instruction. This trades a scalar register for a vector one, and might be generally interesting. Maybe a case to give some further thought, definitely not blocking for this patch.

This revision is now accepted and ready to land.Nov 15 2022, 7:27 AM

Closed by commit rGa214c521f876: [RISCV] Don't use zero-stride vector load for gather if not optimized (authored by • pcwang-thead). · Explain WhyNov 15 2022, 6:44 PM

This revision was automatically updated to reflect the committed changes.

• pcwang-thead added a commit: rGa214c521f876: [RISCV] Don't use zero-stride vector load for gather if not optimized.

• pcwang-thead marked an inline comment as done.Nov 16 2022, 12:35 AM

• pcwang-thead added inline comments.

llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store-asm.ll
230	After thinking, I think there are some opportunities to fold splat into .vx instructions, see D138101.

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVISelLowering.cpp

63 lines

test/

CodeGen/

RISCV/

rvv/

fixed-vector-strided-load-store-asm.ll

76 lines

Diff 475657

llvm/lib/Target/RISCV/RISCVISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,444 Lines • ▼ Show 20 Lines	case Intrinsic::riscv_masked_strided_load: {
if (!IsUnmasked) {		if (!IsUnmasked) {
MVT MaskVT = getMaskTypeFor(ContainerVT);		MVT MaskVT = getMaskTypeFor(ContainerVT);
if (VT.isFixedLengthVector()) {		if (VT.isFixedLengthVector()) {
Mask = convertToScalableVector(MaskVT, Mask, DAG, Subtarget);		Mask = convertToScalableVector(MaskVT, Mask, DAG, Subtarget);
PassThru = convertToScalableVector(ContainerVT, PassThru, DAG, Subtarget);		PassThru = convertToScalableVector(ContainerVT, PassThru, DAG, Subtarget);
}		}
}		}

		auto *Load = cast<MemIntrinsicSDNode>(Op);
SDValue VL = getDefaultVLOps(VT, ContainerVT, DL, DAG, Subtarget).second;		SDValue VL = getDefaultVLOps(VT, ContainerVT, DL, DAG, Subtarget).second;
		SDValue Ptr = Op.getOperand(3);
		SDValue Stride = Op.getOperand(4);
		SDValue Result, Chain;

		// TODO: We restrict this to unmasked loads currently in consideration of
		// the complexity of hanlding all falses masks.
		if (IsUnmasked && isNullConstant(Stride) &&
		!Subtarget.hasOptimizedZeroStrideLoad()) {
		MVT ScalarVT = ContainerVT.getVectorElementType();
		SDValue ScalarLoad =
		DAG.getExtLoad(ISD::ZEXTLOAD, DL, XLenVT, Load->getChain(), Ptr,
		ScalarVT, Load->getMemOperand());
		Chain = ScalarLoad.getValue(1);
		Result = lowerScalarSplat(SDValue(), ScalarLoad, VL, ContainerVT, DL, DAG,
		Subtarget);
		} else {
SDValue IntID = DAG.getTargetConstant(		SDValue IntID = DAG.getTargetConstant(
IsUnmasked ? Intrinsic::riscv_vlse : Intrinsic::riscv_vlse_mask, DL,		IsUnmasked ? Intrinsic::riscv_vlse : Intrinsic::riscv_vlse_mask, DL,
XLenVT);		XLenVT);

auto *Load = cast<MemIntrinsicSDNode>(Op);
SmallVector<SDValue, 8> Ops{Load->getChain(), IntID};		SmallVector<SDValue, 8> Ops{Load->getChain(), IntID};
if (IsUnmasked)		if (IsUnmasked)
Ops.push_back(DAG.getUNDEF(ContainerVT));		Ops.push_back(DAG.getUNDEF(ContainerVT));
else		else
Ops.push_back(PassThru);		Ops.push_back(PassThru);
Ops.push_back(Op.getOperand(3)); // Ptr		Ops.push_back(Ptr);
Ops.push_back(Op.getOperand(4)); // Stride		Ops.push_back(Stride);
if (!IsUnmasked)		if (!IsUnmasked)
Ops.push_back(Mask);		Ops.push_back(Mask);
Ops.push_back(VL);		Ops.push_back(VL);
if (!IsUnmasked) {		if (!IsUnmasked) {
SDValue Policy = DAG.getTargetConstant(RISCVII::TAIL_AGNOSTIC, DL, XLenVT);		SDValue Policy =
		DAG.getTargetConstant(RISCVII::TAIL_AGNOSTIC, DL, XLenVT);
Ops.push_back(Policy);		Ops.push_back(Policy);
}		}

SDVTList VTs = DAG.getVTList({ContainerVT, MVT::Other});		SDVTList VTs = DAG.getVTList({ContainerVT, MVT::Other});
SDValue Result =		Result =
DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, VTs, Ops,		DAG.getMemIntrinsicNode(ISD::INTRINSIC_W_CHAIN, DL, VTs, Ops,
Load->getMemoryVT(), Load->getMemOperand());		Load->getMemoryVT(), Load->getMemOperand());
SDValue Chain = Result.getValue(1);		Chain = Result.getValue(1);
		}
if (VT.isFixedLengthVector())		if (VT.isFixedLengthVector())
Result = convertFromScalableVector(VT, Result, DAG, Subtarget);		Result = convertFromScalableVector(VT, Result, DAG, Subtarget);
return DAG.getMergeValues({Result, Chain}, DL);		return DAG.getMergeValues({Result, Chain}, DL);
}		}
case Intrinsic::riscv_seg2_load:		case Intrinsic::riscv_seg2_load:
case Intrinsic::riscv_seg3_load:		case Intrinsic::riscv_seg3_load:
case Intrinsic::riscv_seg4_load:		case Intrinsic::riscv_seg4_load:
case Intrinsic::riscv_seg5_load:		case Intrinsic::riscv_seg5_load:
▲ Show 20 Lines • Show All 7,812 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/rvv/fixed-vector-strided-load-store-asm.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+v -riscv-v-vector-bits-min=256 \| FileCheck %s --check-prefixes=CHECK,V		; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+v -riscv-v-vector-bits-min=256 \| FileCheck %s --check-prefixes=CHECK,V
; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+f,+zve32f -riscv-v-vector-bits-min=256 \| FileCheck %s --check-prefixes=CHECK,ZVE32F		; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+f,+zve32f -riscv-v-vector-bits-min=256 \| FileCheck %s --check-prefixes=CHECK,ZVE32F
		; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+v,+no-optimized-zero-stride-load -riscv-v-vector-bits-min=256 \| FileCheck %s --check-prefixes=CHECK,NOT-OPTIMIZED
		; RUN: llc < %s -mtriple=riscv64 -mattr=+m,+f,+zve32f,+no-optimized-zero-stride-load -riscv-v-vector-bits-min=256 \| FileCheck %s --check-prefixes=CHECK,NOT-OPTIMIZED

%struct.foo = type { i32, i32, i32, i32 }		%struct.foo = type { i32, i32, i32, i32 }

; void gather(signed char * __restrict A, signed char * __restrict B) {		; void gather(signed char * __restrict A, signed char * __restrict B) {
; for (int i = 0; i != 1024; ++i)		; for (int i = 0; i != 1024; ++i)
; A[i] += B[i * 5];		; A[i] += B[i * 5];
; }		; }
define void @gather(i8* noalias nocapture %A, i8* noalias nocapture readonly %B) {		define void @gather(i8* noalias nocapture %A, i8* noalias nocapture readonly %B) {
▲ Show 20 Lines • Show All 159 Lines • ▼ Show 20 Lines	vector.body: ; preds = %vector.body, %entry
br i1 %6, label %for.cond.cleanup, label %vector.body		br i1 %6, label %for.cond.cleanup, label %vector.body

for.cond.cleanup: ; preds = %vector.body		for.cond.cleanup: ; preds = %vector.body
ret void		ret void
}		}

define void @gather_zero_stride(i8* noalias nocapture %A, i8* noalias nocapture readonly %B) {		define void @gather_zero_stride(i8* noalias nocapture %A, i8* noalias nocapture readonly %B) {
;		;
; CHECK-LABEL: gather_zero_stride:		; V-LABEL: gather_zero_stride:
; CHECK: # %bb.0: # %entry		; V: # %bb.0: # %entry
; CHECK-NEXT: li a2, 0		; V-NEXT: li a2, 0
; CHECK-NEXT: li a3, 32		; V-NEXT: li a3, 32
; CHECK-NEXT: li a4, 1024		; V-NEXT: li a4, 1024
; CHECK-NEXT: .LBB3_1: # %vector.body		; V-NEXT: .LBB3_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; V-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetvli zero, a3, e8, m1, ta, ma		; V-NEXT: vsetvli zero, a3, e8, m1, ta, ma
; CHECK-NEXT: vlse8.v v8, (a1), zero		; V-NEXT: vlse8.v v8, (a1), zero
; CHECK-NEXT: add a5, a0, a2		; V-NEXT: add a5, a0, a2
; CHECK-NEXT: vle8.v v9, (a5)		; V-NEXT: vle8.v v9, (a5)
; CHECK-NEXT: vadd.vv v8, v9, v8		; V-NEXT: vadd.vv v8, v9, v8
; CHECK-NEXT: vse8.v v8, (a5)		; V-NEXT: vse8.v v8, (a5)
; CHECK-NEXT: addi a2, a2, 32		; V-NEXT: addi a2, a2, 32
; CHECK-NEXT: addi a1, a1, 160		; V-NEXT: addi a1, a1, 160
; CHECK-NEXT: bne a2, a4, .LBB3_1		; V-NEXT: bne a2, a4, .LBB3_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; V-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; V-NEXT: ret
		;
		; ZVE32F-LABEL: gather_zero_stride:
		; ZVE32F: # %bb.0: # %entry
		; ZVE32F-NEXT: li a2, 0
		; ZVE32F-NEXT: li a3, 32
		; ZVE32F-NEXT: li a4, 1024
		; ZVE32F-NEXT: .LBB3_1: # %vector.body
		; ZVE32F-NEXT: # =>This Inner Loop Header: Depth=1
		; ZVE32F-NEXT: vsetvli zero, a3, e8, m1, ta, ma
		; ZVE32F-NEXT: vlse8.v v8, (a1), zero
		; ZVE32F-NEXT: add a5, a0, a2
		; ZVE32F-NEXT: vle8.v v9, (a5)
		; ZVE32F-NEXT: vadd.vv v8, v9, v8
		; ZVE32F-NEXT: vse8.v v8, (a5)
		; ZVE32F-NEXT: addi a2, a2, 32
		; ZVE32F-NEXT: addi a1, a1, 160
		; ZVE32F-NEXT: bne a2, a4, .LBB3_1
		; ZVE32F-NEXT: # %bb.2: # %for.cond.cleanup
		; ZVE32F-NEXT: ret
		;
		; NOT-OPTIMIZED-LABEL: gather_zero_stride:
		; NOT-OPTIMIZED: # %bb.0: # %entry
		; NOT-OPTIMIZED-NEXT: li a2, 0
		; NOT-OPTIMIZED-NEXT: li a3, 32
		; NOT-OPTIMIZED-NEXT: li a4, 1024
		; NOT-OPTIMIZED-NEXT: .LBB3_1: # %vector.body
		; NOT-OPTIMIZED-NEXT: # =>This Inner Loop Header: Depth=1
		; NOT-OPTIMIZED-NEXT: lbu a5, 0(a1)
		; NOT-OPTIMIZED-NEXT: add a6, a0, a2
		; NOT-OPTIMIZED-NEXT: vsetvli zero, a3, e8, m1, ta, ma
		; NOT-OPTIMIZED-NEXT: vle8.v v8, (a6)
		; NOT-OPTIMIZED-NEXT: vadd.vx v8, v8, a5
		reamesUnsubmitted Not Done Reply Inline Actions Hm, this test case is interesting. This is a case where even with a fast broadcast load, putting the value into a scalar allows the splat to be folded into the using instruction. This trades a scalar register for a vector one, and might be generally interesting. Maybe a case to give some further thought, definitely not blocking for this patch. reames: Hm, this test case is interesting. This is a case where even with a fast broadcast load…
		pcwang-theadAuthorUnsubmitted Done Reply Inline Actions After thinking, I think there are some opportunities to fold splat into .vx instructions, see D138101. pcwang-thead: After thinking, I think there are some opportunities to fold splat into .vx instructions, see…
		; NOT-OPTIMIZED-NEXT: vse8.v v8, (a6)
		; NOT-OPTIMIZED-NEXT: addi a2, a2, 32
		; NOT-OPTIMIZED-NEXT: addi a1, a1, 160
		; NOT-OPTIMIZED-NEXT: bne a2, a4, .LBB3_1
		; NOT-OPTIMIZED-NEXT: # %bb.2: # %for.cond.cleanup
		; NOT-OPTIMIZED-NEXT: ret
entry:		entry:
br label %vector.body		br label %vector.body

vector.body: ; preds = %vector.body, %entry		vector.body: ; preds = %vector.body, %entry
%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]		%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
%vec.ind = phi <32 x i64> [ zeroinitializer, %entry ], [ %vec.ind.next, %vector.body ]		%vec.ind = phi <32 x i64> [ zeroinitializer, %entry ], [ %vec.ind.next, %vector.body ]
%0 = mul nuw nsw <32 x i64> %vec.ind, <i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5>		%0 = mul nuw nsw <32 x i64> %vec.ind, <i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5, i64 5>
%1 = getelementptr inbounds i8, i8* %B, <32 x i64> %0		%1 = getelementptr inbounds i8, i8* %B, <32 x i64> %0
▲ Show 20 Lines • Show All 722 Lines • Show Last 20 Lines