This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/RISCV/
-
Target/
-
RISCV/
-
RISCVInsertVSETVLI.cpp
-
test/CodeGen/RISCV/rvv/
-
CodeGen/
-
RISCV/
-
rvv/
1/2
sink-splat-operands.ll

Differential D126563

[RISCV] Allow PRE of vsetvli involving non-1 LMUL
ClosedPublic

Authored by reames on May 27 2022, 1:28 PM.

Download Raw Diff

Details

Reviewers

craig.topper
kito-cheng
frasercrmck

Commits

rG85b4470035b7: [RISCV] Allow PRE of vsetvli involving non-1 LMUL

Summary

This is a follow up to address a review comment from D124869. When deciding whether to PRE a vsetvli, we can allow non-LMUL1 vsetvlis.

Diff Detail

Unit TestsFailed

	Time	Test
	60,110 ms	x64 debian > AddressSanitizer-x86_64-linux-dynamic.TestCases::scariness_score_test.cpp
	60,140 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
	60,030 ms	x64 debian > libFuzzer.libFuzzer::fuzzer-leak.test
	60,020 ms	x64 debian > libFuzzer.libFuzzer::minimize_crash.test
	60,030 ms	x64 debian > libFuzzer.libFuzzer::out-of-process-fuzz.test
		View Full Test Results (6 Failed)

Event Timeline

reames created this revision.May 27 2022, 1:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 1:28 PM

Herald added subscribers: sunshaoce, VincentWu, luke957 and 30 others. · View Herald Transcript

reames requested review of this revision.May 27 2022, 1:28 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 27 2022, 1:28 PM

Herald added subscribers: • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

There is an option to scale the bitwidth returned to the vectorizer by TTI getRegisterBitWidth. Using that you can get LMUL>1 fixed vector loops. https://godbolt.org/z/34asbPcv7 should work for scalar too.

In D126563#3543147, @craig.topper wrote:

There is an option to scale the bitwidth returned to the vectorizer by TTI getRegisterBitWidth. Using that you can get LMUL>1 fixed vector loops. https://godbolt.org/z/34asbPcv7 should work for scalar too.

I had added some lmul fixed length tests to test/CodeGen/RISCV/rvv/sink-splat-operands.ll. (Odd name, but it's where all of our non-lmul variants were, so...)

We seem to end up with odd patterns around loads and stores in the loop where we toggle back and forth between e8 and e32. This toggling means we can't currently PRE.

Harbormaster completed remote builds in B166696: Diff 432613.May 27 2022, 2:10 PM

Rebase over Craig's test changes so we can see benefit. Thanks!

craig.topper added inline comments.May 27 2022, 3:33 PM

llvm/test/CodeGen/RISCV/rvv/sink-splat-operands.ll
4181	I guess this didn't optimized because the amount was in a register?

LGTM

This revision is now accepted and ready to land.May 27 2022, 3:35 PM

reames added inline comments.May 27 2022, 3:36 PM

llvm/test/CodeGen/RISCV/rvv/sink-splat-operands.ll
4181	Yep, large constant AVLs are probably a case we need to handle explicitly. Haven't quite fully thought through what we want there.

This revision was landed with ongoing or failed builds.May 27 2022, 3:50 PM

Closed by commit rG85b4470035b7: [RISCV] Allow PRE of vsetvli involving non-1 LMUL (authored by reames). · Explain Why

This revision was automatically updated to reflect the committed changes.

reames added a commit: rG85b4470035b7: [RISCV] Allow PRE of vsetvli involving non-1 LMUL.

Harbormaster completed remote builds in B166719: Diff 432652.May 27 2022, 3:59 PM

Seems like there's no fractional LMULs tested by this patch? Does this suggest we should add some more test coverage?

In D126563#3543770, @frasercrmck wrote:

Seems like there's no fractional LMULs tested by this patch? Does this suggest we should add some more test coverage?

Well, I would, but I could not find an example in tree of what a fractional LMUL looks like in IR. (Probably just because I don't know what syntax looks like). If you give me an example, I can take it from there.

In D126563#3543968, @reames wrote:

In D126563#3543770, @frasercrmck wrote:

Seems like there's no fractional LMULs tested by this patch? Does this suggest we should add some more test coverage?

Well, I would, but I could not find an example in tree of what a fractional LMUL looks like in IR. (Probably just because I don't know what syntax looks like). If you give me an example, I can take it from there.

It's certainly easier with scalable vectors, but we do codegen fractional LMULs for fixed vectors if the minimum VLEN is sufficiently large that we know the vector can be contained within a fraction of a whole register. For example, this (copied) test case uses mf2 with -riscv-v-vector-bits-min=256:

define void @sink_splat_mul_lmulmf2(i32* nocapture %a, i32 signext %x) {
entry:
  %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i64 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  %0 = getelementptr inbounds i32, i32* %a, i64 %index
  %1 = bitcast i32* %0 to <4 x i32>*
  %wide.load = load <4 x i32>, <4 x i32>* %1, align 8
  %2 = mul <4 x i32> %wide.load, %broadcast.splat
  %3 = bitcast i32* %0 to <4 x i32>*
  store <4 x i32> %2, <4 x i32>* %3, align 8
  %index.next = add nuw i64 %index, 4
  %4 = icmp eq i64 %index.next, 1024
  br i1 %4, label %for.cond.cleanup, label %vector.body

for.cond.cleanup:                                 ; preds = %vector.body
  ret void
}

In D126563#3546894, @frasercrmck wrote:
In D126563#3543968, @reames wrote:

In D126563#3543770, @frasercrmck wrote:

Seems like there's no fractional LMULs tested by this patch? Does this suggest we should add some more test coverage?

Well, I would, but I could not find an example in tree of what a fractional LMUL looks like in IR. (Probably just because I don't know what syntax looks like). If you give me an example, I can take it from there.

It's certainly easier with scalable vectors, but we do codegen fractional LMULs for fixed vectors if the minimum VLEN is sufficiently large that we know the vector can be contained within a fraction of a whole register. For example, this (copied) test case uses mf2 with -riscv-v-vector-bits-min=256:
define void @sink_splat_mul_lmulmf2(i32* nocapture %a, i32 signext %x) {
entry:
  %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i64 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  %0 = getelementptr inbounds i32, i32* %a, i64 %index
  %1 = bitcast i32* %0 to <4 x i32>*
  %wide.load = load <4 x i32>, <4 x i32>* %1, align 8
  %2 = mul <4 x i32> %wide.load, %broadcast.splat
  %3 = bitcast i32* %0 to <4 x i32>*
  store <4 x i32> %2, <4 x i32>* %3, align 8
  %index.next = add nuw i64 %index, 4
  %4 = icmp eq i64 %index.next, 1024
  br i1 %4, label %for.cond.cleanup, label %vector.body

for.cond.cleanup:                                 ; preds = %vector.body
  ret void
}

Added coverage in 33b1be591.

For my context, why is it profitable to use fractional LMULs over LMUL=1? I'm aware of the extend/truncate cases, but for operations like VADD, it seems like using mf2 and m1 are equivalent (assuming VL is the same) right?

The only case I can think of that might be profitable would be using a fractional lmul so that VLMax (and thus the x0 encoding) is equal to the AVL. That seems somewhat questionable on it's own.

Using a mix of lmuls makes removing vsetvlis trickier. If we simply canonicalized fractional lmuls to lmul=1 (using knowledge about the vector length if needed for the vlmax case), it seems we'd potentially remove vsetvlis.

At least toggling back and forth between fractional and lmul=1 doesn't change VL for the subset of AVLs less than the fractional width. This does at least mean we can use the AVL preserving variant. (Though, I'm not sure we actually do this... a quick look seems to indicate we don't.)

In general, I'm struggling to understand why we'd want to use fractional lmuls. Any ideas?

In D126563#3547822, @reames wrote:
In D126563#3546894, @frasercrmck wrote:
In D126563#3543968, @reames wrote:

In D126563#3543770, @frasercrmck wrote:

Seems like there's no fractional LMULs tested by this patch? Does this suggest we should add some more test coverage?

Well, I would, but I could not find an example in tree of what a fractional LMUL looks like in IR. (Probably just because I don't know what syntax looks like). If you give me an example, I can take it from there.

It's certainly easier with scalable vectors, but we do codegen fractional LMULs for fixed vectors if the minimum VLEN is sufficiently large that we know the vector can be contained within a fraction of a whole register. For example, this (copied) test case uses mf2 with -riscv-v-vector-bits-min=256:
define void @sink_splat_mul_lmulmf2(i32* nocapture %a, i32 signext %x) {
entry:
  %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i64 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  %0 = getelementptr inbounds i32, i32* %a, i64 %index
  %1 = bitcast i32* %0 to <4 x i32>*
  %wide.load = load <4 x i32>, <4 x i32>* %1, align 8
  %2 = mul <4 x i32> %wide.load, %broadcast.splat
  %3 = bitcast i32* %0 to <4 x i32>*
  store <4 x i32> %2, <4 x i32>* %3, align 8
  %index.next = add nuw i64 %index, 4
  %4 = icmp eq i64 %index.next, 1024
  br i1 %4, label %for.cond.cleanup, label %vector.body

for.cond.cleanup:                                 ; preds = %vector.body
  ret void
}
Added coverage in 33b1be591.

For my context, why is it profitable to use fractional LMULs over LMUL=1? I'm aware of the extend/truncate cases, but for operations like VADD, it seems like using mf2 and m1 are equivalent (assuming VL is the same) right?

You're correct as far as the hardware behavior goes.

The only case I can think of that might be profitable would be using a fractional lmul so that VLMax (and thus the x0 encoding) is equal to the AVL. That seems somewhat questionable on it's own.

Using a mix of lmuls makes removing vsetvlis trickier. If we simply canonicalized fractional lmuls to lmul=1 (using knowledge about the vector length if needed for the vlmax case), it seems we'd potentially remove vsetvlis.

At least toggling back and forth between fractional and lmul=1 doesn't change VL for the subset of AVLs less than the fractional width. This does at least mean we can use the AVL preserving variant. (Though, I'm not sure we actually do this... a quick look seems to indicate we don't.)

In general, I'm struggling to understand why we'd want to use fractional lmuls. Any ideas?

The fixed vector to scalable vector mapping has been designed so that the vectors of ELEN(64 or 32) sized elements with total width <= riscv-v-vector-bits-min will produce an LMUL=1 scalable vector. Wider vectors will use LMUL=2,4,8. Fractional LMUL is not supported for vectors with SEW==ELEN by spec. Vectors with the same number elements and smaller SEW will use a proportionally smaller ELEN. This mapping is independent of what types are actually used in the basic block or function.

So in mixed element width code with all vectors having the same number elements, all the vsetvlis should have the same SEW::LMUL ratio. That seems like the ideal property to have for vsetvli removal.

On an ELEN=64 target, with no i64/f64 elements and all vectors <= riscv-v-vector-bits-min we will only have fractional LMULs.

In D126563#3548691, @craig.topper wrote:
In D126563#3547822, @reames wrote:
In D126563#3546894, @frasercrmck wrote:
In D126563#3543968, @reames wrote:

In D126563#3543770, @frasercrmck wrote:

Seems like there's no fractional LMULs tested by this patch? Does this suggest we should add some more test coverage?

Well, I would, but I could not find an example in tree of what a fractional LMUL looks like in IR. (Probably just because I don't know what syntax looks like). If you give me an example, I can take it from there.

It's certainly easier with scalable vectors, but we do codegen fractional LMULs for fixed vectors if the minimum VLEN is sufficiently large that we know the vector can be contained within a fraction of a whole register. For example, this (copied) test case uses mf2 with -riscv-v-vector-bits-min=256:
define void @sink_splat_mul_lmulmf2(i32* nocapture %a, i32 signext %x) {
entry:
  %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i64 0
  %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  %0 = getelementptr inbounds i32, i32* %a, i64 %index
  %1 = bitcast i32* %0 to <4 x i32>*
  %wide.load = load <4 x i32>, <4 x i32>* %1, align 8
  %2 = mul <4 x i32> %wide.load, %broadcast.splat
  %3 = bitcast i32* %0 to <4 x i32>*
  store <4 x i32> %2, <4 x i32>* %3, align 8
  %index.next = add nuw i64 %index, 4
  %4 = icmp eq i64 %index.next, 1024
  br i1 %4, label %for.cond.cleanup, label %vector.body

for.cond.cleanup:                                 ; preds = %vector.body
  ret void
}
Added coverage in 33b1be591.

For my context, why is it profitable to use fractional LMULs over LMUL=1? I'm aware of the extend/truncate cases, but for operations like VADD, it seems like using mf2 and m1 are equivalent (assuming VL is the same) right?
You're correct as far as the hardware behavior goes.

The only case I can think of that might be profitable would be using a fractional lmul so that VLMax (and thus the x0 encoding) is equal to the AVL. That seems somewhat questionable on it's own.

Using a mix of lmuls makes removing vsetvlis trickier. If we simply canonicalized fractional lmuls to lmul=1 (using knowledge about the vector length if needed for the vlmax case), it seems we'd potentially remove vsetvlis.

At least toggling back and forth between fractional and lmul=1 doesn't change VL for the subset of AVLs less than the fractional width. This does at least mean we can use the AVL preserving variant. (Though, I'm not sure we actually do this... a quick look seems to indicate we don't.)

In general, I'm struggling to understand why we'd want to use fractional lmuls. Any ideas?

The fixed vector to scalable vector mapping has been designed so that the vectors of ELEN(64 or 32) sized elements with total width <= riscv-v-vector-bits-min will produce an LMUL=1 scalable vector. Wider vectors will use LMUL=2,4,8. Fractional LMUL is not supported for vectors with SEW==ELEN by spec. Vectors with the same number elements and smaller SEW will use a proportionally smaller ELEN. This mapping is independent of what types are actually used in the basic block or function.

So in mixed element width code with all vectors having the same number elements, all the vsetvlis should have the same SEW::LMUL ratio. That seems like the ideal property to have for vsetvli removal.

On an ELEN=64 target, with no i64/f64 elements and all vectors <= riscv-v-vector-bits-min we will only have fractional LMULs.

Ok, I think I get your point here. In the world where all of the fixed length vectors are the same number of elements (if possibly different element types), VL does not change when we switch element types. As such, we can use the VL preserving form as I noted above.

I do see some cases where by changing from a fractional lmul to LMUL1, we might be able to remove a vsetvli entirely, but all of those require speculation safety proofs (to avoid having to change VL).

Basically, we're making the (entirely reasonable for real hardware) guess that a VL preserving vsetvli is cheaper to execute than a vsetvli which preserves VTYPE and changes VL. The default scheme biases towards creating cases where we only change VTYPE. The alternate canonicalization (towards a single LMUL), runs the risk of requiring a VL changing vsetvli.

Thanks for the context.

I'll give the speculation cases some more thought, but that's low priority at the moment. I don't have any concrete motivating examples.

Revision Contents

Path

Size

llvm/

lib/

Target/

RISCV/

RISCVInsertVSETVLI.cpp

13 lines

test/

CodeGen/

RISCV/

rvv/

sink-splat-operands.ll

14 lines

Diff 432652

llvm/lib/Target/RISCV/RISCVInsertVSETVLI.cpp

	Show First 20 Lines • Show All 1,284 Lines • ▼ Show 20 Lines
	/// Return true if the VL value configured must be equal to the requested one.			/// Return true if the VL value configured must be equal to the requested one.
	static bool hasFixedResult(const VSETVLIInfo &Info, const RISCVSubtarget &ST) {			static bool hasFixedResult(const VSETVLIInfo &Info, const RISCVSubtarget &ST) {
	if (!Info.hasAVLImm())			if (!Info.hasAVLImm())
	// VLMAX is always the same value.			// VLMAX is always the same value.
	// TODO: Could extend to other registers by looking at the associated			// TODO: Could extend to other registers by looking at the associated
	// vreg def placement.			// vreg def placement.
	return RISCV::X0 == Info.getAVLReg();			return RISCV::X0 == Info.getAVLReg();

	if (RISCVII::LMUL_1 != Info.getVLMUL())
	// TODO: Generalize the code below to account for LMUL
	return false;

	unsigned AVL = Info.getAVLImm();			unsigned AVL = Info.getAVLImm();
	unsigned SEW = Info.getSEW();			unsigned SEW = Info.getSEW();
	unsigned AVLInBits = AVL * SEW;			unsigned AVLInBits = AVL * SEW;
	return ST.getRealMinVLen() >= AVLInBits;
				unsigned LMul;
				bool Fractional;
				std::tie(LMul, Fractional) = RISCVVType::decodeVLMUL(Info.getVLMUL());

				if (Fractional)
				return ST.getRealMinVLen() / LMul >= AVLInBits;
				return ST.getRealMinVLen() * LMul >= AVLInBits;
	}			}

	/// Perform simple partial redundancy elimination of the VSETVLI instructions			/// Perform simple partial redundancy elimination of the VSETVLI instructions
	/// we're about to insert by looking for cases where we can PRE from the			/// we're about to insert by looking for cases where we can PRE from the
	/// beginning of one block to the end of one of its predecessors. Specifically,			/// beginning of one block to the end of one of its predecessors. Specifically,
	/// this is geared to catch the common case of a fixed length vsetvl in a single			/// this is geared to catch the common case of a fixed length vsetvl in a single
	/// block loop when it could execute once in the preheader instead.			/// block loop when it could execute once in the preheader instead.
	void RISCVInsertVSETVLI::doPRE(MachineBasicBlock &MBB) {			void RISCVInsertVSETVLI::doPRE(MachineBasicBlock &MBB) {
	▲ Show 20 Lines • Show All 213 Lines • Show Last 20 Lines

llvm/test/CodeGen/RISCV/rvv/sink-splat-operands.ll

Show First 20 Lines • Show All 3,769 Lines • ▼ Show 20 Lines	for.cond.cleanup: ; preds = %vector.body
ret void		ret void
}		}


define void @sink_splat_mul_lmul2(i64* nocapture %a, i64 signext %x) {		define void @sink_splat_mul_lmul2(i64* nocapture %a, i64 signext %x) {
; CHECK-LABEL: sink_splat_mul_lmul2:		; CHECK-LABEL: sink_splat_mul_lmul2:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a2, 1024		; CHECK-NEXT: li a2, 1024
		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: .LBB67_1: # %vector.body		; CHECK-NEXT: .LBB67_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: vle64.v v8, (a0)		; CHECK-NEXT: vle64.v v8, (a0)
; CHECK-NEXT: vmul.vx v8, v8, a1		; CHECK-NEXT: vmul.vx v8, v8, a1
; CHECK-NEXT: vse64.v v8, (a0)		; CHECK-NEXT: vse64.v v8, (a0)
; CHECK-NEXT: addi a2, a2, -4		; CHECK-NEXT: addi a2, a2, -4
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: bnez a2, .LBB67_1		; CHECK-NEXT: bnez a2, .LBB67_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; CHECK-NEXT: ret
Show All 17 Lines
for.cond.cleanup: ; preds = %vector.body		for.cond.cleanup: ; preds = %vector.body
ret void		ret void
}		}

define void @sink_splat_add_lmul2(i64* nocapture %a, i64 signext %x) {		define void @sink_splat_add_lmul2(i64* nocapture %a, i64 signext %x) {
; CHECK-LABEL: sink_splat_add_lmul2:		; CHECK-LABEL: sink_splat_add_lmul2:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a2, 1024		; CHECK-NEXT: li a2, 1024
		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: .LBB68_1: # %vector.body		; CHECK-NEXT: .LBB68_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: vle64.v v8, (a0)		; CHECK-NEXT: vle64.v v8, (a0)
; CHECK-NEXT: vadd.vx v8, v8, a1		; CHECK-NEXT: vadd.vx v8, v8, a1
; CHECK-NEXT: vse64.v v8, (a0)		; CHECK-NEXT: vse64.v v8, (a0)
; CHECK-NEXT: addi a2, a2, -4		; CHECK-NEXT: addi a2, a2, -4
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: bnez a2, .LBB68_1		; CHECK-NEXT: bnez a2, .LBB68_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; CHECK-NEXT: ret
Show All 17 Lines
for.cond.cleanup: ; preds = %vector.body		for.cond.cleanup: ; preds = %vector.body
ret void		ret void
}		}

define void @sink_splat_sub_lmul2(i64* nocapture %a, i64 signext %x) {		define void @sink_splat_sub_lmul2(i64* nocapture %a, i64 signext %x) {
; CHECK-LABEL: sink_splat_sub_lmul2:		; CHECK-LABEL: sink_splat_sub_lmul2:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a2, 1024		; CHECK-NEXT: li a2, 1024
		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: .LBB69_1: # %vector.body		; CHECK-NEXT: .LBB69_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: vle64.v v8, (a0)		; CHECK-NEXT: vle64.v v8, (a0)
; CHECK-NEXT: vsub.vx v8, v8, a1		; CHECK-NEXT: vsub.vx v8, v8, a1
; CHECK-NEXT: vse64.v v8, (a0)		; CHECK-NEXT: vse64.v v8, (a0)
; CHECK-NEXT: addi a2, a2, -4		; CHECK-NEXT: addi a2, a2, -4
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: bnez a2, .LBB69_1		; CHECK-NEXT: bnez a2, .LBB69_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; CHECK-NEXT: ret
Show All 17 Lines
for.cond.cleanup: ; preds = %vector.body		for.cond.cleanup: ; preds = %vector.body
ret void		ret void
}		}

define void @sink_splat_rsub_lmul2(i64* nocapture %a, i64 signext %x) {		define void @sink_splat_rsub_lmul2(i64* nocapture %a, i64 signext %x) {
; CHECK-LABEL: sink_splat_rsub_lmul2:		; CHECK-LABEL: sink_splat_rsub_lmul2:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a2, 1024		; CHECK-NEXT: li a2, 1024
		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: .LBB70_1: # %vector.body		; CHECK-NEXT: .LBB70_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: vle64.v v8, (a0)		; CHECK-NEXT: vle64.v v8, (a0)
; CHECK-NEXT: vrsub.vx v8, v8, a1		; CHECK-NEXT: vrsub.vx v8, v8, a1
; CHECK-NEXT: vse64.v v8, (a0)		; CHECK-NEXT: vse64.v v8, (a0)
; CHECK-NEXT: addi a2, a2, -4		; CHECK-NEXT: addi a2, a2, -4
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: bnez a2, .LBB70_1		; CHECK-NEXT: bnez a2, .LBB70_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; CHECK-NEXT: ret
Show All 17 Lines
for.cond.cleanup: ; preds = %vector.body		for.cond.cleanup: ; preds = %vector.body
ret void		ret void
}		}

define void @sink_splat_and_lmul2(i64* nocapture %a, i64 signext %x) {		define void @sink_splat_and_lmul2(i64* nocapture %a, i64 signext %x) {
; CHECK-LABEL: sink_splat_and_lmul2:		; CHECK-LABEL: sink_splat_and_lmul2:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a2, 1024		; CHECK-NEXT: li a2, 1024
		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: .LBB71_1: # %vector.body		; CHECK-NEXT: .LBB71_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: vle64.v v8, (a0)		; CHECK-NEXT: vle64.v v8, (a0)
; CHECK-NEXT: vand.vx v8, v8, a1		; CHECK-NEXT: vand.vx v8, v8, a1
; CHECK-NEXT: vse64.v v8, (a0)		; CHECK-NEXT: vse64.v v8, (a0)
; CHECK-NEXT: addi a2, a2, -4		; CHECK-NEXT: addi a2, a2, -4
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: bnez a2, .LBB71_1		; CHECK-NEXT: bnez a2, .LBB71_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; CHECK-NEXT: ret
Show All 17 Lines
for.cond.cleanup: ; preds = %vector.body		for.cond.cleanup: ; preds = %vector.body
ret void		ret void
}		}

define void @sink_splat_or_lmul2(i64* nocapture %a, i64 signext %x) {		define void @sink_splat_or_lmul2(i64* nocapture %a, i64 signext %x) {
; CHECK-LABEL: sink_splat_or_lmul2:		; CHECK-LABEL: sink_splat_or_lmul2:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a2, 1024		; CHECK-NEXT: li a2, 1024
		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: .LBB72_1: # %vector.body		; CHECK-NEXT: .LBB72_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: vle64.v v8, (a0)		; CHECK-NEXT: vle64.v v8, (a0)
; CHECK-NEXT: vor.vx v8, v8, a1		; CHECK-NEXT: vor.vx v8, v8, a1
; CHECK-NEXT: vse64.v v8, (a0)		; CHECK-NEXT: vse64.v v8, (a0)
; CHECK-NEXT: addi a2, a2, -4		; CHECK-NEXT: addi a2, a2, -4
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: bnez a2, .LBB72_1		; CHECK-NEXT: bnez a2, .LBB72_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; CHECK-NEXT: ret
Show All 17 Lines
for.cond.cleanup: ; preds = %vector.body		for.cond.cleanup: ; preds = %vector.body
ret void		ret void
}		}

define void @sink_splat_xor_lmul2(i64* nocapture %a, i64 signext %x) {		define void @sink_splat_xor_lmul2(i64* nocapture %a, i64 signext %x) {
; CHECK-LABEL: sink_splat_xor_lmul2:		; CHECK-LABEL: sink_splat_xor_lmul2:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a2, 1024		; CHECK-NEXT: li a2, 1024
		; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: .LBB73_1: # %vector.body		; CHECK-NEXT: .LBB73_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, mu
; CHECK-NEXT: vle64.v v8, (a0)		; CHECK-NEXT: vle64.v v8, (a0)
; CHECK-NEXT: vxor.vx v8, v8, a1		; CHECK-NEXT: vxor.vx v8, v8, a1
; CHECK-NEXT: vse64.v v8, (a0)		; CHECK-NEXT: vse64.v v8, (a0)
; CHECK-NEXT: addi a2, a2, -4		; CHECK-NEXT: addi a2, a2, -4
; CHECK-NEXT: addi a0, a0, 32		; CHECK-NEXT: addi a0, a0, 32
; CHECK-NEXT: bnez a2, .LBB73_1		; CHECK-NEXT: bnez a2, .LBB73_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; CHECK-NEXT: ret
▲ Show 20 Lines • Show All 168 Lines • ▼ Show 20 Lines

define void @sink_splat_and_lmul8(i32* nocapture %a, i32 signext %x) {		define void @sink_splat_and_lmul8(i32* nocapture %a, i32 signext %x) {
; CHECK-LABEL: sink_splat_and_lmul8:		; CHECK-LABEL: sink_splat_and_lmul8:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: li a2, 1024		; CHECK-NEXT: li a2, 1024
; CHECK-NEXT: li a3, 32		; CHECK-NEXT: li a3, 32
; CHECK-NEXT: .LBB78_1: # %vector.body		; CHECK-NEXT: .LBB78_1: # %vector.body
; CHECK-NEXT: # =>This Inner Loop Header: Depth=1		; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
; CHECK-NEXT: vsetvli zero, a3, e32, m8, ta, mu		; CHECK-NEXT: vsetvli zero, a3, e32, m8, ta, mu
craig.topperUnsubmitted Not Done Reply Inline Actions I guess this didn't optimized because the amount was in a register? craig.topper: I guess this didn't optimized because the amount was in a register?
reamesAuthorUnsubmitted Done Reply Inline Actions Yep, large constant AVLs are probably a case we need to handle explicitly. Haven't quite fully thought through what we want there. reames: Yep, large constant AVLs are probably a case we need to handle explicitly. Haven't quite fully…
; CHECK-NEXT: vle32.v v8, (a0)		; CHECK-NEXT: vle32.v v8, (a0)
; CHECK-NEXT: vand.vx v8, v8, a1		; CHECK-NEXT: vand.vx v8, v8, a1
; CHECK-NEXT: vse32.v v8, (a0)		; CHECK-NEXT: vse32.v v8, (a0)
; CHECK-NEXT: addi a2, a2, -4		; CHECK-NEXT: addi a2, a2, -4
; CHECK-NEXT: addi a0, a0, 16		; CHECK-NEXT: addi a0, a0, 16
; CHECK-NEXT: bnez a2, .LBB78_1		; CHECK-NEXT: bnez a2, .LBB78_1
; CHECK-NEXT: # %bb.2: # %for.cond.cleanup		; CHECK-NEXT: # %bb.2: # %for.cond.cleanup
; CHECK-NEXT: ret		; CHECK-NEXT: ret
▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines