This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
5/5
DAGCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-splat-vector.ll
-
AMDGPU/
8/8
vector_shuffle.packed.ll
-
PowerPC/
2/3
canonical-merge-shuffles.ll
-
WebAssembly/
1/1
simd-vectorized-load-splat.ll
-
X86/
1
avx-vbroadcast.ll
2
half.ll
-
sse41.ll
-
vector-shuffle-128-v2.ll
-
widened-broadcast.ll

Differential D140069

[DAGCombiner] Scalarize vectorized loads that are splatted
AbandonedPublic

Authored by luke on Dec 14 2022, 4:47 PM.

Download Raw Diff

Details

Reviewers

asb
RKSimon
pengfei
dmgreen
nemanjai
arsenm
lebedev.ri

Summary

This un-vectorizes vector loads into scalar loads whenever they are used in a shuffle_vector that are splats, since only one element from them will be accessed.
This opens up better instruction selection of splatted loads and should address a WebAssembly issue: https://github.com/llvm/llvm-project/issues/59120

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,180 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vloxseg.c
	60,190 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vluxseg.c
	60,260 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vloxseg.c
	60,220 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vluxseg.c
	240 ms	x64 debian > LLVM.CodeGen/AArch64::sve-fixed-length-splat-vector.ll

Event Timeline

luke created this revision.Dec 14 2022, 4:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2022, 4:47 PM

Herald added subscribers: pmatos, StephenFan, ecnelises and 6 others. · View Herald Transcript

luke requested review of this revision.Dec 14 2022, 4:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 14 2022, 4:47 PM

Herald added subscribers: llvm-commits, • pcwang-thead, aheejin. · View Herald Transcript

Fix commit message

Harbormaster completed remote builds in B203244: Diff 483036.Dec 14 2022, 6:56 PM

Checked changes in X86 tests are all correct.

In D140069#3997198, @pengfei wrote:

Checked changes in X86 tests are all correct.

Thanks! It looks like some other X86 tests are still failing, I'll try to address them also.

On aarch64 some tests in arm64-dup.ll are failing:

define <8 x i8> @vduplane8(<8 x i8>* %A) nounwind {
; CHECK-LABEL: vduplane8:
; CHECK:       // %bb.0:
; CHECK-NEXT:    ldr d0, [x0]
; CHECK-NEXT:    dup.8b v0, v0[1]
; CHECK-NEXT:    ret
	%tmp1 = load <8 x i8>, <8 x i8>* %A
	%tmp2 = shufflevector <8 x i8> %tmp1, <8 x i8> undef, <8 x i32> < i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1 >
	ret <8 x i8> %tmp2
}

Since they're now being selected as a broadcasted load:

vduplane8:                              // @vduplane8
// %bb.0:
	add	x8, x0, #1
	ld1r.8b	{ v0 }, [x8]
	ret

I'm not familiar with aarch64, but is it not possible to fold the offset in like this?

vduplane8:                              // @vduplane8
// %bb.0:
	ld1r.8b	{ v0 }, [x0], #1
	ret

Update tests

Herald added subscribers: kosarev, kerbowa, jvesely, nemanjai. · View Herald TranscriptDec 15 2022, 7:27 AM

luke added inline comments.Dec 15 2022, 7:35 AM

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
138–140	I presume this a regression since even though it's loading smaller sizes, it has to do more twiddling.
llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
1153–1166	These extra lines replace the old `P8-AIX` prefixed checks that must have been left behind
llvm/test/CodeGen/X86/half.ll
1342–1345	@pengfei This looks like a regression, the scalarized load t18 gets selected as `VPINSRWrm` t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t17: i64 = add t2, Constant:i64<8> t18: f16,ch = load<(load (s16) from %ir.p + 8, align 8)> t0, t17, undef:i64 t21: v8f16 = scalar_to_vector t18 t23: v8i16 = bitcast t21 t28: v8i16 = X86ISD::PSHUFLW t23, TargetConstant:i8<0> t29: v4i32 = bitcast t28 t30: v4i32 = X86ISD::PSHUFD t29, TargetConstant:i8<0> t36: v8f16 = bitcast t30 t10: ch,glue = CopyToReg t0, Register:v8f16 $xmm0, t36 t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8f16 $xmm0, t10:1

Update aarch64 tests

luke added inline comments.Dec 15 2022, 7:39 AM

llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗	(On Diff #483185)	Regression: Can the `add` be folded in as an immediate offset to `ld1r.8h { v0 }, [x8]`? Same applies for the cases below

luke marked 3 inline comments as not done.Dec 15 2022, 7:42 AM

Harbormaster completed remote builds in B203352: Diff 483185.Dec 15 2022, 9:01 AM

pengfei added inline comments.Dec 15 2022, 9:33 PM

llvm/test/CodeGen/X86/half.ll
1342–1345	Right. I think this is a special case. We don't have native scalar instructions to load/store `half` or `bfloat` in old targets. Instead, we have to use the more expensive pinsrw/pextrw to emulate. Which makes scalar load/store operations are suboptimal to vector ones. Have you noticed if other targets have a similar problem. It's better if we can find a way to avoid the regression, otherwise, I think we can add FIXME at the moment.

luke added a subscriber: t.p.northover.Dec 20 2022, 4:18 AM

luke added inline comments.

llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗	(On Diff #483185)	No, since the offset would actually increment the register operand. Should we instead check if the target is able to perform an indexed load, and bail otherwise when the offset != 0? There's a target lowering hook called `isIndexingLegal` that seems like it could be used to check this, but no targets currently implement it, and it was added for GlobalISel: https://reviews.llvm.org/D66287 @t.p.northover would you have any thoughts on this?

dmgreen added a subscriber: dmgreen.Dec 20 2022, 6:33 AM

dmgreen added inline comments.

llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗	(On Diff #483185)	Hello. This can actually be worse, but I don't think that's an issue with this patch. We tend to prefer ld1r over the dup in the mul instruction, but I believe the opposite can be quicker. That is a general issue though, this patch is just exposing it in a few extra places. None of these tests need to load data, and I think it would be better to remove that part. If they just use vectors parameters directly then they will be less susceptible to optimizations on the load altering what the test is intended to check. I can put a patch together to clean this up, and if you rebase on top most of these changes should hopefully disappear.

dmgreen mentioned this in rG752819e813d1: [AArch64][ARM] Remove load from dup and vmul tests. NFC.Dec 20 2022, 7:23 AM

dmgreen added inline comments.Dec 20 2022, 8:21 AM

llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗	(On Diff #483185)	Hopefully if you rebase over rG752819e813d1, most of these changes will go away.

luke added inline comments.Dec 20 2022, 8:22 AM

llvm/test/CodeGen/AArch64/arm64-vmul.ll
1102–1106 ↗	(On Diff #483185)	Hello, thanks a million for that patch. Will rebase.

luke added inline comments.Dec 20 2022, 8:26 AM

llvm/test/CodeGen/AArch64/arm64-vmul.ll
1227–1230 ↗	(On Diff #483185)	@dmgreen This looks like a regression to me but I'm not familiar enough with aarch64 to really know for certain. I presume the cost of the additional add instruction outweighs any gains from a smaller load, is that correct? (Hope I'm not bombarding you with too many questions, let me know if there's someone else I can ask!)

Rebase

Thanks. The remaining Arm and AArch64 test cases look OK to me.

llvm/test/CodeGen/AArch64/arm64-vmul.ll
1227–1230 ↗	(On Diff #483185)	Yeah it's the same as the other cases. I heard that it can be a little worse than if the dup could be part of the mul/fmulx/etc, but that's a separate issue from this patch. In practice many cases will already be a splat of a scalar value, so will already run into the same issue.

Harbormaster completed remote builds in B204174: Diff 484287.Dec 20 2022, 10:06 AM

luke added subscribers: ruiling, foad.Dec 20 2022, 12:11 PM

luke added inline comments.

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1895–1904	@foad @ruiling Apologies if I'm pinging the wrong people here, just wanted to get some AMDGPU eyes over this. From what I understand this looks like a regression since the two loads aren't dispatched in tandem anymore, there's separate waits. Are there any suggestions as to how to avoid this/are there any target info hooks that might be relevant here?

foad added subscribers: rampitec, arsenm.Dec 20 2022, 11:43 PM

foad added inline comments.

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1895–1904	Yes it looks like a regression but I'm not sure how serious it is. The original code did two 4-byte loads even though we only want the upper two bytes of each value. Now we've turned the second one into a 2-byte load that overwrites part of the result of the first load, hence the WAW dependency. Why can't we also turn the first load into a 2-byte load? Also @rampitec @arsenm

ruiling added inline comments.Dec 21 2022, 4:34 AM

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1895–1904	An extra wait usually means serious regression. But I did not see why we need the s_waitcnt here. The `global_load`s should return the value in order, so there is no WAW dependency here, right? @foad

arsenm added inline comments.Dec 21 2022, 5:01 AM

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1895–1904	The waitcnt insertion pass probably doesn't try to understand the tied operands of the d16 loads

RKSimon added reviewers: RKSimon, pengfei, dmgreen, nemanjai, arsenm.Dec 21 2022, 1:18 PM

Herald added a subscriber: wdng. · View Herald TranscriptDec 21 2022, 1:18 PM

foad added inline comments.Dec 22 2022, 12:06 AM

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1895–1904	Right, at the MIR level it looks like a RAW dependency because the d16 load has a tied read representing the parts of the destination register that are not overwritten. So I guess we could fix this in the waitcnt insertion pass. (It sounds similar to the special case for writelane in AMDGPUInsertDelayAlu.)

RKSimon added inline comments.Dec 22 2022, 2:10 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
23551	Cleaner to use Shuf->getValueType(0) - result/operands will have matching types in DAG.
23556	Please check the operand numbers - LHS (Op0) = 0 <= Idx < NumElts RHS (Op1) = NumElts <= Idx < 2*NumElts
23567	Pull out repeated Shuf->getSplatIndex() calls

Address review comments

luke marked 3 inline comments as done.Jan 9 2023, 2:55 AM

foad added inline comments.Jan 9 2023, 3:01 AM

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1895–1904	D140537 should fix this when it lands.

RKSimon added inline comments.Jan 9 2023, 3:01 AM

llvm/test/CodeGen/X86/avx-vbroadcast.ll
367	; Pointer adjusted broadcasts

Harbormaster completed remote builds in B206488: Diff 487360.Jan 9 2023, 4:33 AM

Rebase and update comments

Update AMDGPU tests with D140537

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll
1895–1904	Thanks!

Harbormaster completed remote builds in B207341: Diff 488569.Jan 12 2023, 4:57 AM

RKSimon added a reviewer: lebedev.ri.Jan 12 2023, 8:42 AM

RKSimon added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
23548	Similar to D140811 - we might want to limit this to cases where the shuffle is not just a single non-undef mask element

Limit to cases where the shuffle is not just a single non-undef mask element

luke marked an inline comment as done.Jan 13 2023, 4:57 AM

luke added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
23548	Done: Is there another optimisation that occurs on a splat w/ a single non-undef mask element?

Update affected tests

Harbormaster completed remote builds in B207609: Diff 488962.Jan 13 2023, 6:26 AM

RKSimon added inline comments.Jan 14 2023, 2:07 PM

llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
1153–1166	Aren't the P8-AIX/P8-AIX-32/P8-AIX-64 checks still used? There's more likely an issue with aix-32 having a legal <2 x i64> type but no the i64 type
llvm/test/CodeGen/WebAssembly/simd-vectorized-load-splat.ll
5	please can you pre-commit this test file to trunk with trunk's current codegen and then rebase to show the codegen diffs

Split out webassembly test into pre-commit test

luke added a parent revision: D141835: [WebAssembly][NFC] Add pre-commit test for D140069.Jan 16 2023, 2:44 AM

Harbormaster completed remote builds in B208001: Diff 489477.Jan 16 2023, 4:03 AM

luke marked 6 inline comments as done.Jan 16 2023, 4:23 AM

luke added inline comments.Jan 16 2023, 4:30 AM

llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll
1153–1166	Sorry you're right, not sure why I believed that. So if I'm understanding this correctly now, `P8-AIX-NEXT` checks are generated whenever aix-32 and aix-64 have the same lines. Will take a look

Only scalarize if scalar type is legal

Harbormaster completed remote builds in B208018: Diff 489506.Jan 16 2023, 5:30 AM

luke abandoned this revision.Apr 3 2023, 4:35 AM

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

47 lines

test/

CodeGen/

AArch64/

sve-fixed-length-splat-vector.ll

18 lines

AMDGPU/

vector_shuffle.packed.ll

6 lines

PowerPC/

canonical-merge-shuffles.ll

13 lines

WebAssembly/

simd-vectorized-load-splat.ll

24 lines

X86/

avx-vbroadcast.ll

8 lines

half.ll

17 lines

sse41.ll

10 lines

vector-shuffle-128-v2.ll

2 lines

widened-broadcast.ll

2 lines

Diff 489506

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 23,535 Lines • ▼ Show 20 Lines	static SDValue formSplatFromShuffles(ShuffleVectorSDNode *OuterShuf,
assert(VT == InnerShuf->getValueType(0) && "Expected matching shuffle types");		assert(VT == InnerShuf->getValueType(0) && "Expected matching shuffle types");
if (!DAG.getTargetLoweringInfo().isShuffleMaskLegal(CombinedMask, VT))		if (!DAG.getTargetLoweringInfo().isShuffleMaskLegal(CombinedMask, VT))
return SDValue();		return SDValue();

return DAG.getVectorShuffle(VT, SDLoc(OuterShuf), InnerShuf->getOperand(0),		return DAG.getVectorShuffle(VT, SDLoc(OuterShuf), InnerShuf->getOperand(0),
InnerShuf->getOperand(1), CombinedMask);		InnerShuf->getOperand(1), CombinedMask);
}		}

		/// Given a vectorized load used in a splat, scalarize the load to only load the
		/// element required for splatting.
		static SDValue scalarizeLoadIntoSplat(ShuffleVectorSDNode *Shuf,
		SelectionDAG &DAG) {
		if (!Shuf->isSplat() \|\| count(Shuf->getMask(), Shuf->getSplatIndex()) <= 1)
		RKSimonUnsubmitted Done Reply Inline Actions Similar to D140811 - we might want to limit this to cases where the shuffle is not just a single non-undef mask element RKSimon: Similar to D140811 - we might want to limit this to cases where the shuffle is not just a…
		lukeAuthorUnsubmitted Done Reply Inline Actions Done: Is there another optimisation that occurs on a splat w/ a single non-undef mask element? luke: Done: Is there another optimisation that occurs on a splat w/ a single non-undef mask element?
		return SDValue();

		EVT VecVT = Shuf->getValueType(0);
		RKSimonUnsubmitted Done Reply Inline Actions Cleaner to use Shuf->getValueType(0) - result/operands will have matching types in DAG. RKSimon: Cleaner to use Shuf->getValueType(0) - result/operands will have matching types in DAG.
		unsigned SplatIdx = Shuf->getSplatIndex();
		SDValue SplattedOp;
		if (SplatIdx < VecVT.getVectorNumElements())
		SplattedOp = Shuf->getOperand(0);
		else
		RKSimonUnsubmitted Done Reply Inline Actions Please check the operand numbers - LHS (Op0) = 0 <= Idx < NumElts RHS (Op1) = NumElts <= Idx < 2NumElts RKSimon:* Please check the operand numbers - LHS (Op0) = 0 <= Idx < NumElts RHS (Op1) = NumElts <= Idx <…
		SplattedOp = Shuf->getOperand(1);

		LoadSDNode *Load = dyn_cast<LoadSDNode>(Shuf->getOperand(0).getNode());
		if (!Load)
		return SDValue();

		if (!(Load->isSimple() && Load->hasOneUse() && VecVT.isVector()))
		return SDValue();

		auto &TLI = DAG.getTargetLoweringInfo();
		SDValue SplatIdxV =
		RKSimonUnsubmitted Done Reply Inline Actions Pull out repeated Shuf->getSplatIndex() calls RKSimon: Pull out repeated Shuf->getSplatIndex() calls
		DAG.getConstant(SplatIdx, SDLoc(Shuf), MVT::i32);
		SDValue NewPtr =
		TLI.getVectorElementPointer(DAG, Load->getBasePtr(), VecVT, SplatIdxV);

		EVT VecEltVT = VecVT.getVectorElementType();
		if (!TLI.isTypeLegal(VecEltVT))
		return SDValue();

		unsigned PtrOff = VecEltVT.getSizeInBits() * SplatIdx / 8;
		MachinePointerInfo MPI = Load->getPointerInfo().getWithOffset(PtrOff);
		Align Alignment = commonAlignment(Load->getAlign(), PtrOff);

		auto NewLoad = DAG.getLoad(VecEltVT, SDLoc(Load), Load->getChain(), NewPtr,
		MPI, Alignment, Load->getMemOperand()->getFlags(),
		Load->getAAInfo());
		DAG.makeEquivalentMemoryOrdering(Load, NewLoad);

		return DAG.getSplatBuildVector(Shuf->getValueType(0), SDLoc(Shuf), NewLoad);
		}

/// If the shuffle mask is taking exactly one element from the first vector		/// If the shuffle mask is taking exactly one element from the first vector
/// operand and passing through all other elements from the second vector		/// operand and passing through all other elements from the second vector
/// operand, return the index of the mask element that is choosing an element		/// operand, return the index of the mask element that is choosing an element
/// from the first operand. Otherwise, return -1.		/// from the first operand. Otherwise, return -1.
static int getShuffleMaskIndexOfOneElementFromOp0IntoOp1(ArrayRef<int> Mask) {		static int getShuffleMaskIndexOfOneElementFromOp0IntoOp1(ArrayRef<int> Mask) {
int MaskSize = Mask.size();		int MaskSize = Mask.size();
int EltFromOp0 = -1;		int EltFromOp0 = -1;
// TODO: This does not match if there are undef elements in the shuffle mask.		// TODO: This does not match if there are undef elements in the shuffle mask.
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitVECTOR_SHUFFLE(SDNode *N) {

// A shuffle of a single vector that is a splatted value can always be folded.		// A shuffle of a single vector that is a splatted value can always be folded.
if (SDValue V = combineShuffleOfSplatVal(SVN, DAG))		if (SDValue V = combineShuffleOfSplatVal(SVN, DAG))
return V;		return V;

if (SDValue V = formSplatFromShuffles(SVN, DAG))		if (SDValue V = formSplatFromShuffles(SVN, DAG))
return V;		return V;

		if (SDValue V = scalarizeLoadIntoSplat(SVN, DAG))
		return V;

// If it is a splat, check if the argument vector is another splat or a		// If it is a splat, check if the argument vector is another splat or a
// build_vector.		// build_vector.
if (SVN->isSplat() && SVN->getSplatIndex() < (int)NumElts) {		if (SVN->isSplat() && SVN->getSplatIndex() < (int)NumElts) {
int SplatIndex = SVN->getSplatIndex();		int SplatIndex = SVN->getSplatIndex();
if (N0.hasOneUse() && TLI.isExtractVecEltCheap(VT, SplatIndex) &&		if (N0.hasOneUse() && TLI.isExtractVecEltCheap(VT, SplatIndex) &&
TLI.isBinOp(N0.getOpcode()) && N0->getNumValues() == 1) {		TLI.isBinOp(N0.getOpcode()) && N0->getNumValues() == 1) {
// splat (vector_bo L, R), Index -->		// splat (vector_bo L, R), Index -->
// splat (scalar_bo (extelt L, Index), (extelt R, Index))		// splat (scalar_bo (extelt L, Index), (extelt R, Index))
▲ Show 20 Lines • Show All 2,598 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-splat-vector.ll

Show First 20 Lines • Show All 710 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
store <8 x double> %splat, ptr %a		store <8 x double> %splat, ptr %a
ret void		ret void
}		}

define <8 x float> @load_splat_v8f32(ptr %p) vscale_range(2,2) #0 {		define <8 x float> @load_splat_v8f32(ptr %p) vscale_range(2,2) #0 {
; CHECK-LABEL: load_splat_v8f32:		; CHECK-LABEL: load_splat_v8f32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.s		; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]		; CHECK-NEXT: ld1rw { z0.s }, p0/z, [x0]
; CHECK-NEXT: mov z0.s, s0
; CHECK-NEXT: st1w { z0.s }, p0, [x8]		; CHECK-NEXT: st1w { z0.s }, p0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v = load <8 x float>, ptr %p		%v = load <8 x float>, ptr %p
%splat = shufflevector <8 x float> %v, <8 x float> undef, <8 x i32> zeroinitializer		%splat = shufflevector <8 x float> %v, <8 x float> undef, <8 x i32> zeroinitializer
ret <8 x float> %splat		ret <8 x float> %splat
}		}

define <4 x double> @load_splat_v4f64(ptr %p) vscale_range(2,2) #0 {		define <4 x double> @load_splat_v4f64(ptr %p) vscale_range(2,2) #0 {
; CHECK-LABEL: load_splat_v4f64:		; CHECK-LABEL: load_splat_v4f64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.d		; CHECK-NEXT: ptrue p0.d
; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]		; CHECK-NEXT: ld1rd { z0.d }, p0/z, [x0]
; CHECK-NEXT: mov z0.d, d0
; CHECK-NEXT: st1d { z0.d }, p0, [x8]		; CHECK-NEXT: st1d { z0.d }, p0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v = load <4 x double>, ptr %p		%v = load <4 x double>, ptr %p
%splat = shufflevector <4 x double> %v, <4 x double> undef, <4 x i32> zeroinitializer		%splat = shufflevector <4 x double> %v, <4 x double> undef, <4 x i32> zeroinitializer
ret <4 x double> %splat		ret <4 x double> %splat
}		}

define <32 x i8> @load_splat_v32i8(ptr %p) vscale_range(2,2) #0 {		define <32 x i8> @load_splat_v32i8(ptr %p) vscale_range(2,2) #0 {
; CHECK-LABEL: load_splat_v32i8:		; CHECK-LABEL: load_splat_v32i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.b		; CHECK-NEXT: ptrue p0.b
; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]		; CHECK-NEXT: ld1rb { z0.b }, p0/z, [x0]
; CHECK-NEXT: mov z0.b, b0
; CHECK-NEXT: st1b { z0.b }, p0, [x8]		; CHECK-NEXT: st1b { z0.b }, p0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v = load <32 x i8>, ptr %p		%v = load <32 x i8>, ptr %p
%splat = shufflevector <32 x i8> %v, <32 x i8> undef, <32 x i32> zeroinitializer		%splat = shufflevector <32 x i8> %v, <32 x i8> undef, <32 x i32> zeroinitializer
ret <32 x i8> %splat		ret <32 x i8> %splat
}		}

define <16 x i16> @load_splat_v16i16(ptr %p) vscale_range(2,2) #0 {		define <16 x i16> @load_splat_v16i16(ptr %p) vscale_range(2,2) #0 {
; CHECK-LABEL: load_splat_v16i16:		; CHECK-LABEL: load_splat_v16i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.h		; CHECK-NEXT: ptrue p0.h
; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]		; CHECK-NEXT: ld1rh { z0.h }, p0/z, [x0]
; CHECK-NEXT: mov z0.h, h0
; CHECK-NEXT: st1h { z0.h }, p0, [x8]		; CHECK-NEXT: st1h { z0.h }, p0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v = load <16 x i16>, ptr %p		%v = load <16 x i16>, ptr %p
%splat = shufflevector <16 x i16> %v, <16 x i16> undef, <16 x i32> zeroinitializer		%splat = shufflevector <16 x i16> %v, <16 x i16> undef, <16 x i32> zeroinitializer
ret <16 x i16> %splat		ret <16 x i16> %splat
}		}

define <8 x i32> @load_splat_v8i32(ptr %p) vscale_range(2,2) #0 {		define <8 x i32> @load_splat_v8i32(ptr %p) vscale_range(2,2) #0 {
; CHECK-LABEL: load_splat_v8i32:		; CHECK-LABEL: load_splat_v8i32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.s		; CHECK-NEXT: ptrue p0.s
; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]		; CHECK-NEXT: ld1rw { z0.s }, p0/z, [x0]
; CHECK-NEXT: mov z0.s, s0
; CHECK-NEXT: st1w { z0.s }, p0, [x8]		; CHECK-NEXT: st1w { z0.s }, p0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v = load <8 x i32>, ptr %p		%v = load <8 x i32>, ptr %p
%splat = shufflevector <8 x i32> %v, <8 x i32> undef, <8 x i32> zeroinitializer		%splat = shufflevector <8 x i32> %v, <8 x i32> undef, <8 x i32> zeroinitializer
ret <8 x i32> %splat		ret <8 x i32> %splat
}		}

define <4 x i64> @load_splat_v4i64(ptr %p) vscale_range(2,2) #0 {		define <4 x i64> @load_splat_v4i64(ptr %p) vscale_range(2,2) #0 {
; CHECK-LABEL: load_splat_v4i64:		; CHECK-LABEL: load_splat_v4i64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ptrue p0.d		; CHECK-NEXT: ptrue p0.d
; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]		; CHECK-NEXT: ld1rd { z0.d }, p0/z, [x0]
; CHECK-NEXT: mov z0.d, d0
; CHECK-NEXT: st1d { z0.d }, p0, [x8]		; CHECK-NEXT: st1d { z0.d }, p0, [x8]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%v = load <4 x i64>, ptr %p		%v = load <4 x i64>, ptr %p
%splat = shufflevector <4 x i64> %v, <4 x i64> undef, <4 x i32> zeroinitializer		%splat = shufflevector <4 x i64> %v, <4 x i64> undef, <4 x i32> zeroinitializer
ret <4 x i64> %splat		ret <4 x i64> %splat
}		}

attributes #0 = { "target-features"="+sve" }		attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll

Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	; GFX11-NEXT: s_setpc_b64 s[30:31]
%shuffle = shufflevector <4 x half> %val0, <4 x half> %val1, <4 x i32> <i32 undef, i32 3, i32 undef, i32 1>		%shuffle = shufflevector <4 x half> %val0, <4 x half> %val1, <4 x i32> <i32 undef, i32 3, i32 undef, i32 1>
ret <4 x half> %shuffle		ret <4 x half> %shuffle
}		}

define <4 x half> @shuffle_v4f16_u3uu(ptr addrspace(1) %arg0, ptr addrspace(1) %arg1) {		define <4 x half> @shuffle_v4f16_u3uu(ptr addrspace(1) %arg0, ptr addrspace(1) %arg1) {
; GFX9-LABEL: shuffle_v4f16_u3uu:		; GFX9-LABEL: shuffle_v4f16_u3uu:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: global_load_dword v0, v[0:1], off offset:4		; GFX9-NEXT: global_load_dword v0, v[0:1], off offset:4
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
		lukeAuthorUnsubmitted Done Reply Inline Actions I presume this a regression since even though it's loading smaller sizes, it has to do more twiddling. luke: I presume this a regression since even though it's loading smaller sizes, it has to do more…
;		;
; GFX10-LABEL: shuffle_v4f16_u3uu:		; GFX10-LABEL: shuffle_v4f16_u3uu:
; GFX10: ; %bb.0:		; GFX10: ; %bb.0:
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0		; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: global_load_dword v0, v[0:1], off offset:4		; GFX10-NEXT: global_load_dword v0, v[0:1], off offset:4
; GFX10-NEXT: s_waitcnt vmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0)
; GFX10-NEXT: s_setpc_b64 s[30:31]		; GFX10-NEXT: s_setpc_b64 s[30:31]
▲ Show 20 Lines • Show All 1,014 Lines • ▼ Show 20 Lines	; GFX11-NEXT: s_setpc_b64 s[30:31]
%shuffle = shufflevector <4 x i16> %val0, <4 x i16> %val1, <4 x i32> <i32 0, i32 1, i32 6, i32 7>		%shuffle = shufflevector <4 x i16> %val0, <4 x i16> %val1, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
ret <4 x i16> %shuffle		ret <4 x i16> %shuffle
}		}

define <4 x half> @shuffle_v4f16_0000(ptr addrspace(1) %arg0, ptr addrspace(1) %arg1) {		define <4 x half> @shuffle_v4f16_0000(ptr addrspace(1) %arg0, ptr addrspace(1) %arg1) {
; GFX9-LABEL: shuffle_v4f16_0000:		; GFX9-LABEL: shuffle_v4f16_0000:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: global_load_dwordx2 v[0:1], v[0:1], off		; GFX9-NEXT: global_load_ushort v0, v[0:1], off
; GFX9-NEXT: s_mov_b32 s4, 0x5040100		; GFX9-NEXT: s_mov_b32 s4, 0x5040100
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_perm_b32 v0, v0, v0, s4		; GFX9-NEXT: v_perm_b32 v0, v0, v0, s4
; GFX9-NEXT: v_mov_b32_e32 v1, v0		; GFX9-NEXT: v_mov_b32_e32 v1, v0
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX10-LABEL: shuffle_v4f16_0000:		; GFX10-LABEL: shuffle_v4f16_0000:
; GFX10: ; %bb.0:		; GFX10: ; %bb.0:
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0		; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: global_load_dwordx2 v[0:1], v[0:1], off		; GFX10-NEXT: global_load_ushort v0, v[0:1], off
; GFX10-NEXT: s_waitcnt vmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0)
; GFX10-NEXT: v_perm_b32 v0, v0, v0, 0x5040100		; GFX10-NEXT: v_perm_b32 v0, v0, v0, 0x5040100
; GFX10-NEXT: v_mov_b32_e32 v1, v0		; GFX10-NEXT: v_mov_b32_e32 v1, v0
; GFX10-NEXT: s_setpc_b64 s[30:31]		; GFX10-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX11-LABEL: shuffle_v4f16_0000:		; GFX11-LABEL: shuffle_v4f16_0000:
; GFX11: ; %bb.0:		; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_waitcnt_vscnt null, 0x0		; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-NEXT: global_load_b64 v[0:1], v[0:1], off		; GFX11-NEXT: global_load_u16 v0, v[0:1], off
; GFX11-NEXT: s_waitcnt vmcnt(0)		; GFX11-NEXT: s_waitcnt vmcnt(0)
; GFX11-NEXT: v_perm_b32 v0, v0, v0, 0x5040100		; GFX11-NEXT: v_perm_b32 v0, v0, v0, 0x5040100
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)		; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-NEXT: v_mov_b32_e32 v1, v0		; GFX11-NEXT: v_mov_b32_e32 v1, v0
; GFX11-NEXT: s_setpc_b64 s[30:31]		; GFX11-NEXT: s_setpc_b64 s[30:31]
%val0 = load <4 x half>, ptr addrspace(1) %arg0		%val0 = load <4 x half>, ptr addrspace(1) %arg0
%val1 = load <4 x half>, ptr addrspace(1) %arg1		%val1 = load <4 x half>, ptr addrspace(1) %arg1
%shuffle = shufflevector <4 x half> %val0, <4 x half> %val1, <4 x i32> zeroinitializer		%shuffle = shufflevector <4 x half> %val0, <4 x half> %val1, <4 x i32> zeroinitializer
▲ Show 20 Lines • Show All 686 Lines • ▼ Show 20 Lines
entry:		entry:
%0 = load <2 x half>, ptr addrspace(1) %x0, align 4		%0 = load <2 x half>, ptr addrspace(1) %x0, align 4
%1 = load <2 x half>, ptr addrspace(1) %x1, align 4		%1 = load <2 x half>, ptr addrspace(1) %x1, align 4
%vy1.0.vec.insert = shufflevector <2 x half> %0, <2 x half> poison, <2 x i32> <i32 0, i32 undef>		%vy1.0.vec.insert = shufflevector <2 x half> %0, <2 x half> poison, <2 x i32> <i32 0, i32 undef>
%vy1.2.vec.insert = shufflevector <2 x half> %vy1.0.vec.insert, <2 x half> %1, <2 x i32> <i32 0, i32 2>		%vy1.2.vec.insert = shufflevector <2 x half> %vy1.0.vec.insert, <2 x half> %1, <2 x i32> <i32 0, i32 2>
ret <2 x half> %vy1.2.vec.insert		ret <2 x half> %vy1.2.vec.insert
}		}

define <2 x half> @hi16bits(ptr addrspace(1) %x0, ptr addrspace(1) %x1) {		define <2 x half> @hi16bits(ptr addrspace(1) %x0, ptr addrspace(1) %x1) {
; GFX9-LABEL: hi16bits:		; GFX9-LABEL: hi16bits:
; GFX9: ; %bb.0: ; %entry		; GFX9: ; %bb.0: ; %entry
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: global_load_dword v4, v[0:1], off		; GFX9-NEXT: global_load_dword v4, v[0:1], off
; GFX9-NEXT: global_load_dword v5, v[2:3], off		; GFX9-NEXT: global_load_dword v5, v[2:3], off
; GFX9-NEXT: s_mov_b32 s4, 0x7060302		; GFX9-NEXT: s_mov_b32 s4, 0x7060302
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_perm_b32 v0, v5, v4, s4		; GFX9-NEXT: v_perm_b32 v0, v5, v4, s4
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
		lukeAuthorUnsubmitted Done Reply Inline Actions @foad @ruiling Apologies if I'm pinging the wrong people here, just wanted to get some AMDGPU eyes over this. From what I understand this looks like a regression since the two loads aren't dispatched in tandem anymore, there's separate waits. Are there any suggestions as to how to avoid this/are there any target info hooks that might be relevant here? luke: @foad @ruiling Apologies if I'm pinging the wrong people here, just wanted to get some AMDGPU…
		foadUnsubmitted Done Reply Inline Actions Yes it looks like a regression but I'm not sure how serious it is. The original code did two 4-byte loads even though we only want the upper two bytes of each value. Now we've turned the second one into a 2-byte load that overwrites part of the result of the first load, hence the WAW dependency. Why can't we also turn the first load into a 2-byte load? Also @rampitec @arsenm foad: Yes it looks like a regression but I'm not sure how serious it is. The original code did two 4…
		ruilingUnsubmitted Done Reply Inline Actions An extra wait usually means serious regression. But I did not see why we need the s_waitcnt here. The `global_load`s should return the value in order, so there is no WAW dependency here, right? @foad ruiling: An extra wait usually means serious regression. But I did not see why we need the s_waitcnt…
		arsenmUnsubmitted Done Reply Inline Actions The waitcnt insertion pass probably doesn't try to understand the tied operands of the d16 loads arsenm: The waitcnt insertion pass probably doesn't try to understand the tied operands of the d16 loads
		foadUnsubmitted Done Reply Inline Actions Right, at the MIR level it looks like a RAW dependency because the d16 load has a tied read representing the parts of the destination register that are not overwritten. So I guess we could fix this in the waitcnt insertion pass. (It sounds similar to the special case for writelane in AMDGPUInsertDelayAlu.) foad: Right, at the MIR level it looks like a RAW dependency because the d16 load has a tied read…
		foadUnsubmitted Done Reply Inline Actions D140537 should fix this when it lands. foad: D140537 should fix this when it lands.
		lukeAuthorUnsubmitted Done Reply Inline Actions Thanks! luke: Thanks!
;		;
; GFX10-LABEL: hi16bits:		; GFX10-LABEL: hi16bits:
; GFX10: ; %bb.0: ; %entry		; GFX10: ; %bb.0: ; %entry
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0		; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: global_load_dword v4, v[0:1], off		; GFX10-NEXT: global_load_dword v4, v[0:1], off
; GFX10-NEXT: global_load_dword v5, v[2:3], off		; GFX10-NEXT: global_load_dword v5, v[2:3], off
; GFX10-NEXT: s_waitcnt vmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0)
▲ Show 20 Lines • Show All 929 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll

	Show First 20 Lines • Show All 1,121 Lines • ▼ Show 20 Lines
	; CHECK-P9-BE-LABEL: testSplati64_1:			; CHECK-P9-BE-LABEL: testSplati64_1:
	; CHECK-P9-BE: # %bb.0: # %entry			; CHECK-P9-BE: # %bb.0: # %entry
	; CHECK-P9-BE-NEXT: addi r3, r3, 8			; CHECK-P9-BE-NEXT: addi r3, r3, 8
	; CHECK-P9-BE-NEXT: lxvdsx v2, 0, r3			; CHECK-P9-BE-NEXT: lxvdsx v2, 0, r3
	; CHECK-P9-BE-NEXT: blr			; CHECK-P9-BE-NEXT: blr
	;			;
	; CHECK-NOVSX-LABEL: testSplati64_1:			; CHECK-NOVSX-LABEL: testSplati64_1:
	; CHECK-NOVSX: # %bb.0: # %entry			; CHECK-NOVSX: # %bb.0: # %entry
	; CHECK-NOVSX-NEXT: ld r4, 8(r3)			; CHECK-NOVSX-NEXT: ld r3, 8(r3)
	; CHECK-NOVSX-NEXT: std r4, -8(r1)			; CHECK-NOVSX-NEXT: addi r4, r1, -16
	; CHECK-NOVSX-NEXT: addis r4, r2, .LCPI21_0@toc@ha			; CHECK-NOVSX-NEXT: std r3, -8(r1)
	; CHECK-NOVSX-NEXT: ld r3, 0(r3)
	; CHECK-NOVSX-NEXT: addi r4, r4, .LCPI21_0@toc@l
	; CHECK-NOVSX-NEXT: lvx v2, 0, r4
	; CHECK-NOVSX-NEXT: std r3, -16(r1)			; CHECK-NOVSX-NEXT: std r3, -16(r1)
	; CHECK-NOVSX-NEXT: addi r3, r1, -16			; CHECK-NOVSX-NEXT: lvx v2, 0, r4
	; CHECK-NOVSX-NEXT: lvx v3, 0, r3
	; CHECK-NOVSX-NEXT: vperm v2, v3, v3, v2
	; CHECK-NOVSX-NEXT: blr			; CHECK-NOVSX-NEXT: blr
	;			;
	; CHECK-P7-LABEL: testSplati64_1:			; CHECK-P7-LABEL: testSplati64_1:
	; CHECK-P7: # %bb.0: # %entry			; CHECK-P7: # %bb.0: # %entry
	; CHECK-P7-NEXT: addi r3, r3, 8			; CHECK-P7-NEXT: addi r3, r3, 8
	; CHECK-P7-NEXT: lxvdsx v2, 0, r3			; CHECK-P7-NEXT: lxvdsx v2, 0, r3
	; CHECK-P7-NEXT: blr			; CHECK-P7-NEXT: blr
	;			;
	; P8-AIX-LABEL: testSplati64_1:			; P8-AIX-LABEL: testSplati64_1:
	; P8-AIX: # %bb.0: # %entry			; P8-AIX: # %bb.0: # %entry
	; P8-AIX-NEXT: addi r3, r3, 8			; P8-AIX-NEXT: addi r3, r3, 8
	; P8-AIX-NEXT: lxvdsx v2, 0, r3			; P8-AIX-NEXT: lxvdsx v2, 0, r3
	; P8-AIX-NEXT: blr			; P8-AIX-NEXT: blr
	entry:			entry:
	%0 = load <2 x i64>, ptr %ptr, align 8			%0 = load <2 x i64>, ptr %ptr, align 8
	%1 = shufflevector <2 x i64> %0, <2 x i64> undef, <2 x i32> <i32 1, i32 1>			%1 = shufflevector <2 x i64> %0, <2 x i64> undef, <2 x i32> <i32 1, i32 1>
	ret <2 x i64> %1			ret <2 x i64> %1
	}			}

	define dso_local void @testByteSplat() #0 {			define dso_local void @testByteSplat() #0 {
	; CHECK-P8-LABEL: testByteSplat:			; CHECK-P8-LABEL: testByteSplat:
	; CHECK-P8: # %bb.0: # %entry			; CHECK-P8: # %bb.0: # %entry
	; CHECK-P8-NEXT: lbzx r3, 0, r3			; CHECK-P8-NEXT: lbzx r3, 0, r3
	; CHECK-P8-NEXT: mtvsrwz v2, r3			; CHECK-P8-NEXT: mtvsrwz v2, r3
	; CHECK-P8-NEXT: vspltb v2, v2, 7			; CHECK-P8-NEXT: vspltb v2, v2, 7
	; CHECK-P8-NEXT: xxswapd vs0, v2			; CHECK-P8-NEXT: xxswapd vs0, v2
	; CHECK-P8-NEXT: stxvd2x vs0, 0, r3			; CHECK-P8-NEXT: stxvd2x vs0, 0, r3
	; CHECK-P8-NEXT: blr			; CHECK-P8-NEXT: blr
	;			;
	; CHECK-P9-LABEL: testByteSplat:			; CHECK-P9-LABEL: testByteSplat:
	; CHECK-P9: # %bb.0: # %entry			; CHECK-P9: # %bb.0: # %entry
	; CHECK-P9-NEXT: lxsibzx v2, 0, r3			; CHECK-P9-NEXT: lxsibzx v2, 0, r3
				lukeAuthorUnsubmitted Done Reply Inline Actions These extra lines replace the old `P8-AIX` prefixed checks that must have been left behind luke: These extra lines replace the old `P8-AIX` prefixed checks that must have been left behind
				RKSimonUnsubmitted Not Done Reply Inline Actions Aren't the P8-AIX/P8-AIX-32/P8-AIX-64 checks still used? There's more likely an issue with aix-32 having a legal <2 x i64> type but no the i64 type RKSimon: Aren't the P8-AIX/P8-AIX-32/P8-AIX-64 checks still used? There's more likely an issue with aix…
				lukeAuthorUnsubmitted Done Reply Inline Actions Sorry you're right, not sure why I believed that. So if I'm understanding this correctly now, `P8-AIX-NEXT` checks are generated whenever aix-32 and aix-64 have the same lines. Will take a look luke: Sorry you're right, not sure why I believed that. So if I'm understanding this correctly now…
	; CHECK-P9-NEXT: vspltb v2, v2, 7			; CHECK-P9-NEXT: vspltb v2, v2, 7
	; CHECK-P9-NEXT: stxv v2, 0(r3)			; CHECK-P9-NEXT: stxv v2, 0(r3)
	; CHECK-P9-NEXT: blr			; CHECK-P9-NEXT: blr
	;			;
	; CHECK-P9-BE-LABEL: testByteSplat:			; CHECK-P9-BE-LABEL: testByteSplat:
	; CHECK-P9-BE: # %bb.0: # %entry			; CHECK-P9-BE: # %bb.0: # %entry
	; CHECK-P9-BE-NEXT: lxsibzx v2, 0, r3			; CHECK-P9-BE-NEXT: lxsibzx v2, 0, r3
	; CHECK-P9-BE-NEXT: vspltb v2, v2, 7			; CHECK-P9-BE-NEXT: vspltb v2, v2, 7
	Show All 40 Lines

llvm/test/CodeGen/WebAssembly/simd-vectorized-load-splat.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -verify-machineinstrs -mattr=+simd128 \| FileCheck %s			; RUN: llc < %s -verify-machineinstrs -mattr=+simd128 \| FileCheck %s

	; Ensures that vectorized loads that are really just splatted loads, are indeed			; Ensures that vectorized loads that are really just splatted loads, are indeed
	; selected as splatted loads			; selected as splatted loads
				RKSimonUnsubmitted Done Reply Inline Actions please can you pre-commit this test file to trunk with trunk's current codegen and then rebase to show the codegen diffs RKSimon: please can you pre-commit this test file to trunk with trunk's current codegen and then rebase…

	target triple = "wasm32-unknown-unknown"			target triple = "wasm32-unknown-unknown"

	define <4 x i32> @load_splat_shuhffle_lhs(ptr %p) {			define <4 x i32> @load_splat_shuhffle_lhs(ptr %p) {
	; CHECK-LABEL: load_splat_shuhffle_lhs:			; CHECK-LABEL: load_splat_shuhffle_lhs:
	; CHECK: .functype load_splat_shuhffle_lhs (i32) -> (v128)			; CHECK: .functype load_splat_shuhffle_lhs (i32) -> (v128)
	; CHECK-NEXT: .local v128
	; CHECK-NEXT: # %bb.0:			; CHECK-NEXT: # %bb.0:
	; CHECK-NEXT: local.get 0			; CHECK-NEXT: local.get 0
	; CHECK-NEXT: v128.load 0			; CHECK-NEXT: v128.load64_splat 0
	; CHECK-NEXT: local.get 1
	; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7
	; CHECK-NEXT: # fallthrough-return			; CHECK-NEXT: # fallthrough-return
	%a = load <2 x i64>, ptr %p			%a = load <2 x i64>, ptr %p
	%b = shufflevector <2 x i64> %a, <2 x i64> poison, <2 x i32> <i32 0, i32 0>			%b = shufflevector <2 x i64> %a, <2 x i64> poison, <2 x i32> <i32 0, i32 0>
	%c = bitcast <2 x i64> %b to <4 x i32>			%c = bitcast <2 x i64> %b to <4 x i32>
	%d = shufflevector <4 x i32> %c, <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>			%d = shufflevector <4 x i32> %c, <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
	ret <4 x i32> %d			ret <4 x i32> %d
	}			}

	define <4 x i32> @load_splat_shuffle_lhs_with_offset(ptr %p) {			define <4 x i32> @load_splat_shuffle_lhs_with_offset(ptr %p) {
	; CHECK-LABEL: load_splat_shuffle_lhs_with_offset:			; CHECK-LABEL: load_splat_shuffle_lhs_with_offset:
	; CHECK: .functype load_splat_shuffle_lhs_with_offset (i32) -> (v128)			; CHECK: .functype load_splat_shuffle_lhs_with_offset (i32) -> (v128)
	; CHECK-NEXT: .local v128
	; CHECK-NEXT: # %bb.0:			; CHECK-NEXT: # %bb.0:
	; CHECK-NEXT: local.get 0			; CHECK-NEXT: local.get 0
	; CHECK-NEXT: v128.load 0			; CHECK-NEXT: i32.const 8
	; CHECK-NEXT: local.get 1			; CHECK-NEXT: i32.add
	; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 8, 9, 10, 11, 12, 13, 14, 15			; CHECK-NEXT: v128.load64_splat 0
	; CHECK-NEXT: # fallthrough-return			; CHECK-NEXT: # fallthrough-return
	%a = load <2 x i64>, ptr %p			%a = load <2 x i64>, ptr %p
	%b = shufflevector <2 x i64> %a, <2 x i64> poison, <2 x i32> <i32 1, i32 undef>			%b = shufflevector <2 x i64> %a, <2 x i64> poison, <2 x i32> <i32 1, i32 undef>
	%c = bitcast <2 x i64> %b to <4 x i32>			%c = bitcast <2 x i64> %b to <4 x i32>
	%d = shufflevector <4 x i32> %c, <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>			%d = shufflevector <4 x i32> %c, <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
	ret <4 x i32> %d			ret <4 x i32> %d
	}			}

	define <4 x i32> @load_splat_shuffle_rhs(ptr %p) {			define <4 x i32> @load_splat_shuffle_rhs(ptr %p) {
	; CHECK-LABEL: load_splat_shuffle_rhs:			; CHECK-LABEL: load_splat_shuffle_rhs:
	; CHECK: .functype load_splat_shuffle_rhs (i32) -> (v128)			; CHECK: .functype load_splat_shuffle_rhs (i32) -> (v128)
	; CHECK-NEXT: .local v128
	; CHECK-NEXT: # %bb.0:			; CHECK-NEXT: # %bb.0:
	; CHECK-NEXT: local.get 0			; CHECK-NEXT: local.get 0
	; CHECK-NEXT: v128.load 0			; CHECK-NEXT: v128.load64_splat 0
	; CHECK-NEXT: local.get 1
	; CHECK-NEXT: i8x16.shuffle 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7
	; CHECK-NEXT: # fallthrough-return			; CHECK-NEXT: # fallthrough-return
	%a = load <2 x i64>, ptr %p			%a = load <2 x i64>, ptr %p
	%b = shufflevector <2 x i64> poison, <2 x i64> %a, <2 x i32> <i32 2, i32 undef>			%b = shufflevector <2 x i64> poison, <2 x i64> %a, <2 x i32> <i32 2, i32 undef>
	%c = bitcast <2 x i64> %b to <4 x i32>			%c = bitcast <2 x i64> %b to <4 x i32>
	%d = shufflevector <4 x i32> %c, <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>			%d = shufflevector <4 x i32> %c, <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
	ret <4 x i32> %d			ret <4 x i32> %d
	}			}

	define <4 x i32> @load_splat_shuffle_rhs_with_offset(ptr %p) {			define <4 x i32> @load_splat_shuffle_rhs_with_offset(ptr %p) {
	; CHECK-LABEL: load_splat_shuffle_rhs_with_offset:			; CHECK-LABEL: load_splat_shuffle_rhs_with_offset:
	; CHECK: .functype load_splat_shuffle_rhs_with_offset (i32) -> (v128)			; CHECK: .functype load_splat_shuffle_rhs_with_offset (i32) -> (v128)
	; CHECK-NEXT: .local v128
	; CHECK-NEXT: # %bb.0:			; CHECK-NEXT: # %bb.0:
	; CHECK-NEXT: local.get 0			; CHECK-NEXT: local.get 0
	; CHECK-NEXT: v128.load 0			; CHECK-NEXT: i32.const 8
	; CHECK-NEXT: local.get 1			; CHECK-NEXT: i32.add
	; CHECK-NEXT: i8x16.shuffle 8, 9, 10, 11, 12, 13, 14, 15, 8, 9, 10, 11, 12, 13, 14, 15			; CHECK-NEXT: v128.load64_splat 0
	; CHECK-NEXT: # fallthrough-return			; CHECK-NEXT: # fallthrough-return
	%a = load <2 x i64>, ptr %p			%a = load <2 x i64>, ptr %p
	%b = shufflevector <2 x i64> poison, <2 x i64> %a, <2 x i32> <i32 3, i32 undef>			%b = shufflevector <2 x i64> poison, <2 x i64> %a, <2 x i32> <i32 3, i32 undef>
	%c = bitcast <2 x i64> %b to <4 x i32>			%c = bitcast <2 x i64> %b to <4 x i32>
	%d = shufflevector <4 x i32> %c, <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>			%d = shufflevector <4 x i32> %c, <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
	ret <4 x i32> %d			ret <4 x i32> %d
	}			}

llvm/test/CodeGen/X86/avx-vbroadcast.ll

Show First 20 Lines • Show All 358 Lines • ▼ Show 20 Lines	entry:
store i32 %q, ptr %ptr2, align 4 ; to create a chain to prevent broadcast		store i32 %q, ptr %ptr2, align 4 ; to create a chain to prevent broadcast
%vecinit.i = insertelement <4 x i32> undef, i32 %q, i32 0		%vecinit.i = insertelement <4 x i32> undef, i32 %q, i32 0
%vecinit2.i = insertelement <4 x i32> %vecinit.i, i32 %q, i32 1		%vecinit2.i = insertelement <4 x i32> %vecinit.i, i32 %q, i32 1
%vecinit4.i = insertelement <4 x i32> %vecinit2.i, i32 %q, i32 2		%vecinit4.i = insertelement <4 x i32> %vecinit2.i, i32 %q, i32 2
%vecinit6.i = insertelement <4 x i32> %vecinit4.i, i32 %q, i32 3		%vecinit6.i = insertelement <4 x i32> %vecinit4.i, i32 %q, i32 3
ret <4 x i32> %vecinit6.i		ret <4 x i32> %vecinit6.i
}		}

; FIXME: Pointer adjusted broadcasts		; Pointer adjusted broadcasts
		RKSimonUnsubmitted Not Done Reply Inline Actions ; Pointer adjusted broadcasts RKSimon: ; Pointer adjusted broadcasts

define <4 x i32> @load_splat_4i32_4i32_1111(ptr %ptr) nounwind uwtable readnone ssp {		define <4 x i32> @load_splat_4i32_4i32_1111(ptr %ptr) nounwind uwtable readnone ssp {
; X86-LABEL: load_splat_4i32_4i32_1111:		; X86-LABEL: load_splat_4i32_4i32_1111:
; X86: ## %bb.0: ## %entry		; X86: ## %bb.0: ## %entry
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: vpermilps {{.*#+}} xmm0 = mem[1,1,1,1]		; X86-NEXT: vbroadcastss 4(%eax), %xmm0
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: load_splat_4i32_4i32_1111:		; X64-LABEL: load_splat_4i32_4i32_1111:
; X64: ## %bb.0: ## %entry		; X64: ## %bb.0: ## %entry
; X64-NEXT: vpermilps {{.*#+}} xmm0 = mem[1,1,1,1]		; X64-NEXT: vbroadcastss 4(%rdi), %xmm0
; X64-NEXT: retq		; X64-NEXT: retq
entry:		entry:
%ld = load <4 x i32>, ptr %ptr		%ld = load <4 x i32>, ptr %ptr
%ret = shufflevector <4 x i32> %ld, <4 x i32> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>		%ret = shufflevector <4 x i32> %ld, <4 x i32> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
ret <4 x i32> %ret		ret <4 x i32> %ret
}		}

define <8 x i32> @load_splat_8i32_4i32_33333333(ptr %ptr) nounwind uwtable readnone ssp {		define <8 x i32> @load_splat_8i32_4i32_33333333(ptr %ptr) nounwind uwtable readnone ssp {
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines
; X86-LABEL: load_splat_2i64_2i64_1111:		; X86-LABEL: load_splat_2i64_2i64_1111:
; X86: ## %bb.0: ## %entry		; X86: ## %bb.0: ## %entry
; X86-NEXT: movl {{[0-9]+}}(%esp), %eax		; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-NEXT: vpermilps {{.*#+}} xmm0 = mem[2,3,2,3]		; X86-NEXT: vpermilps {{.*#+}} xmm0 = mem[2,3,2,3]
; X86-NEXT: retl		; X86-NEXT: retl
;		;
; X64-LABEL: load_splat_2i64_2i64_1111:		; X64-LABEL: load_splat_2i64_2i64_1111:
; X64: ## %bb.0: ## %entry		; X64: ## %bb.0: ## %entry
; X64-NEXT: vpermilps {{.*#+}} xmm0 = mem[2,3,2,3]		; X64-NEXT: vmovddup {{.*#+}} xmm0 = mem[0,0]
; X64-NEXT: retq		; X64-NEXT: retq
entry:		entry:
%ld = load <2 x i64>, ptr %ptr		%ld = load <2 x i64>, ptr %ptr
%ret = shufflevector <2 x i64> %ld, <2 x i64> undef, <2 x i32> <i32 1, i32 1>		%ret = shufflevector <2 x i64> %ld, <2 x i64> undef, <2 x i32> <i32 1, i32 1>
ret <2 x i64> %ret		ret <2 x i64> %ret
}		}

define <4 x i64> @load_splat_4i64_2i64_1111(ptr %ptr) nounwind uwtable readnone ssp {		define <4 x i64> @load_splat_4i64_2i64_1111(ptr %ptr) nounwind uwtable readnone ssp {
▲ Show 20 Lines • Show All 509 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/half.ll

	Show First 20 Lines • Show All 1,326 Lines • ▼ Show 20 Lines
	; CHECK-I686-NEXT: retl			; CHECK-I686-NEXT: retl
	%s = select i1 %c, <8 x half> %x, <8 x half> %y			%s = select i1 %c, <8 x half> %x, <8 x half> %y
	ret <8 x half> %s			ret <8 x half> %s
	}			}

	define <8 x half> @shuffle(ptr %p) {			define <8 x half> @shuffle(ptr %p) {
	; CHECK-LIBCALL-LABEL: shuffle:			; CHECK-LIBCALL-LABEL: shuffle:
	; CHECK-LIBCALL: # %bb.0:			; CHECK-LIBCALL: # %bb.0:
	; CHECK-LIBCALL-NEXT: movdqu (%rdi), %xmm0			; CHECK-LIBCALL-NEXT: pinsrw $0, 8(%rdi), %xmm0
	; CHECK-LIBCALL-NEXT: pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,4,4,4]			; CHECK-LIBCALL-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]
	; CHECK-LIBCALL-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,2,2]			; CHECK-LIBCALL-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; CHECK-LIBCALL-NEXT: retq			; CHECK-LIBCALL-NEXT: retq
	;			;
	; BWON-F16C-LABEL: shuffle:			; BWON-F16C-LABEL: shuffle:
	; BWON-F16C: # %bb.0:			; BWON-F16C: # %bb.0:
	; BWON-F16C-NEXT: vpshufhw {{.*#+}} xmm0 = mem[0,1,2,3,4,4,4,4]			; BWON-F16C-NEXT: vpinsrw $0, 8(%rdi), %xmm0, %xmm0
	; BWON-F16C-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[2,2,2,2]			; BWON-F16C-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]
				; BWON-F16C-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; BWON-F16C-NEXT: retq			; BWON-F16C-NEXT: retq
				lukeAuthorUnsubmitted Not Done Reply Inline Actions @pengfei This looks like a regression, the scalarized load t18 gets selected as `VPINSRWrm` t0: ch,glue = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t17: i64 = add t2, Constant:i64<8> t18: f16,ch = load<(load (s16) from %ir.p + 8, align 8)> t0, t17, undef:i64 t21: v8f16 = scalar_to_vector t18 t23: v8i16 = bitcast t21 t28: v8i16 = X86ISD::PSHUFLW t23, TargetConstant:i8<0> t29: v4i32 = bitcast t28 t30: v4i32 = X86ISD::PSHUFD t29, TargetConstant:i8<0> t36: v8f16 = bitcast t30 t10: ch,glue = CopyToReg t0, Register:v8f16 $xmm0, t36 t11: ch = X86ISD::RET_FLAG t10, TargetConstant:i32<0>, Register:v8f16 $xmm0, t10:1 luke: @pengfei This looks like a regression, the scalarized load t18 gets selected as `VPINSRWrm`…
				pengfeiUnsubmitted Not Done Reply Inline Actions Right. I think this is a special case. We don't have native scalar instructions to load/store `half` or `bfloat` in old targets. Instead, we have to use the more expensive pinsrw/pextrw to emulate. Which makes scalar load/store operations are suboptimal to vector ones. Have you noticed if other targets have a similar problem. It's better if we can find a way to avoid the regression, otherwise, I think we can add FIXME at the moment. pengfei: Right. I think this is a special case. We don't have native scalar instructions to load/store…
	;			;
	; CHECK-I686-LABEL: shuffle:			; CHECK-I686-LABEL: shuffle:
	; CHECK-I686: # %bb.0:			; CHECK-I686: # %bb.0:
	; CHECK-I686-NEXT: movl {{[0-9]+}}(%esp), %eax			; CHECK-I686-NEXT: movl {{[0-9]+}}(%esp), %eax
	; CHECK-I686-NEXT: movdqu (%eax), %xmm0			; CHECK-I686-NEXT: pinsrw $0, 8(%eax), %xmm0
	; CHECK-I686-NEXT: pshufhw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,4,4,4]			; CHECK-I686-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]
	; CHECK-I686-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,2,2,2]			; CHECK-I686-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; CHECK-I686-NEXT: retl			; CHECK-I686-NEXT: retl
	%1 = load <8 x half>, ptr %p, align 8			%1 = load <8 x half>, ptr %p, align 8
	%2 = shufflevector <8 x half> %1, <8 x half> poison, <8 x i32> <i32 4, i32 4, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%2 = shufflevector <8 x half> %1, <8 x half> poison, <8 x i32> <i32 4, i32 4, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	ret <8 x half> %2			ret <8 x half> %2
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

llvm/test/CodeGen/X86/sse41.ll

Show First 20 Lines • Show All 1,581 Lines • ▼ Show 20 Lines	; X64-AVX512-NEXT: retq ## encoding: [0xc3]
%7 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %6, i32 48)		%7 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %a, <4 x float> %6, i32 48)
ret <4 x float> %7		ret <4 x float> %7
}		}

define <4 x float> @insertps_from_broadcast_loadv4f32(<4 x float> %a, ptr nocapture readonly %b) {		define <4 x float> @insertps_from_broadcast_loadv4f32(<4 x float> %a, ptr nocapture readonly %b) {
; X86-SSE-LABEL: insertps_from_broadcast_loadv4f32:		; X86-SSE-LABEL: insertps_from_broadcast_loadv4f32:
; X86-SSE: ## %bb.0:		; X86-SSE: ## %bb.0:
; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %eax ## encoding: [0x8b,0x44,0x24,0x04]		; X86-SSE-NEXT: movl {{[0-9]+}}(%esp), %eax ## encoding: [0x8b,0x44,0x24,0x04]
; X86-SSE-NEXT: movups (%eax), %xmm1 ## encoding: [0x0f,0x10,0x08]		; X86-SSE-NEXT: insertps $48, (%eax), %xmm0 ## encoding: [0x66,0x0f,0x3a,0x21,0x00,0x30]
; X86-SSE-NEXT: insertps $48, %xmm1, %xmm0 ## encoding: [0x66,0x0f,0x3a,0x21,0xc1,0x30]		; X86-SSE-NEXT: ## xmm0 = xmm0[0,1,2],mem[0]
; X86-SSE-NEXT: ## xmm0 = xmm0[0,1,2],xmm1[0]
; X86-SSE-NEXT: retl ## encoding: [0xc3]		; X86-SSE-NEXT: retl ## encoding: [0xc3]
;		;
; X86-AVX1-LABEL: insertps_from_broadcast_loadv4f32:		; X86-AVX1-LABEL: insertps_from_broadcast_loadv4f32:
; X86-AVX1: ## %bb.0:		; X86-AVX1: ## %bb.0:
; X86-AVX1-NEXT: movl {{[0-9]+}}(%esp), %eax ## encoding: [0x8b,0x44,0x24,0x04]		; X86-AVX1-NEXT: movl {{[0-9]+}}(%esp), %eax ## encoding: [0x8b,0x44,0x24,0x04]
; X86-AVX1-NEXT: vinsertps $48, (%eax), %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x21,0x00,0x30]		; X86-AVX1-NEXT: vinsertps $48, (%eax), %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x21,0x00,0x30]
; X86-AVX1-NEXT: ## xmm0 = xmm0[0,1,2],mem[0]		; X86-AVX1-NEXT: ## xmm0 = xmm0[0,1,2],mem[0]
; X86-AVX1-NEXT: retl ## encoding: [0xc3]		; X86-AVX1-NEXT: retl ## encoding: [0xc3]
;		;
; X86-AVX512-LABEL: insertps_from_broadcast_loadv4f32:		; X86-AVX512-LABEL: insertps_from_broadcast_loadv4f32:
; X86-AVX512: ## %bb.0:		; X86-AVX512: ## %bb.0:
; X86-AVX512-NEXT: movl {{[0-9]+}}(%esp), %eax ## encoding: [0x8b,0x44,0x24,0x04]		; X86-AVX512-NEXT: movl {{[0-9]+}}(%esp), %eax ## encoding: [0x8b,0x44,0x24,0x04]
; X86-AVX512-NEXT: vinsertps $48, (%eax), %xmm0, %xmm0 ## EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x21,0x00,0x30]		; X86-AVX512-NEXT: vinsertps $48, (%eax), %xmm0, %xmm0 ## EVEX TO VEX Compression encoding: [0xc4,0xe3,0x79,0x21,0x00,0x30]
; X86-AVX512-NEXT: ## xmm0 = xmm0[0,1,2],mem[0]		; X86-AVX512-NEXT: ## xmm0 = xmm0[0,1,2],mem[0]
; X86-AVX512-NEXT: retl ## encoding: [0xc3]		; X86-AVX512-NEXT: retl ## encoding: [0xc3]
;		;
; X64-SSE-LABEL: insertps_from_broadcast_loadv4f32:		; X64-SSE-LABEL: insertps_from_broadcast_loadv4f32:
; X64-SSE: ## %bb.0:		; X64-SSE: ## %bb.0:
; X64-SSE-NEXT: movups (%rdi), %xmm1 ## encoding: [0x0f,0x10,0x0f]		; X64-SSE-NEXT: insertps $48, (%rdi), %xmm0 ## encoding: [0x66,0x0f,0x3a,0x21,0x07,0x30]
; X64-SSE-NEXT: insertps $48, %xmm1, %xmm0 ## encoding: [0x66,0x0f,0x3a,0x21,0xc1,0x30]		; X64-SSE-NEXT: ## xmm0 = xmm0[0,1,2],mem[0]
; X64-SSE-NEXT: ## xmm0 = xmm0[0,1,2],xmm1[0]
; X64-SSE-NEXT: retq ## encoding: [0xc3]		; X64-SSE-NEXT: retq ## encoding: [0xc3]
;		;
; X64-AVX1-LABEL: insertps_from_broadcast_loadv4f32:		; X64-AVX1-LABEL: insertps_from_broadcast_loadv4f32:
; X64-AVX1: ## %bb.0:		; X64-AVX1: ## %bb.0:
; X64-AVX1-NEXT: vinsertps $48, (%rdi), %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x21,0x07,0x30]		; X64-AVX1-NEXT: vinsertps $48, (%rdi), %xmm0, %xmm0 ## encoding: [0xc4,0xe3,0x79,0x21,0x07,0x30]
; X64-AVX1-NEXT: ## xmm0 = xmm0[0,1,2],mem[0]		; X64-AVX1-NEXT: ## xmm0 = xmm0[0,1,2],mem[0]
; X64-AVX1-NEXT: retq ## encoding: [0xc3]		; X64-AVX1-NEXT: retq ## encoding: [0xc3]
;		;
▲ Show 20 Lines • Show All 548 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-shuffle-128-v2.ll

Show First 20 Lines • Show All 1,216 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%v = insertelement <2 x double> undef, double %a, i32 0		%v = insertelement <2 x double> undef, double %a, i32 0
%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> <i32 0, i32 0>		%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> <i32 0, i32 0>
ret <2 x double> %shuffle		ret <2 x double> %shuffle
}		}

define <2 x double> @insert_dup_mem128_v2f64(ptr %ptr) nounwind {		define <2 x double> @insert_dup_mem128_v2f64(ptr %ptr) nounwind {
; SSE2-LABEL: insert_dup_mem128_v2f64:		; SSE2-LABEL: insert_dup_mem128_v2f64:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: movaps (%rdi), %xmm0		; SSE2-NEXT: movsd {{.*#+}} xmm0 = mem[0],zero
; SSE2-NEXT: movlhps {{.*#+}} xmm0 = xmm0[0,0]		; SSE2-NEXT: movlhps {{.*#+}} xmm0 = xmm0[0,0]
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; SSE3-LABEL: insert_dup_mem128_v2f64:		; SSE3-LABEL: insert_dup_mem128_v2f64:
; SSE3: # %bb.0:		; SSE3: # %bb.0:
; SSE3-NEXT: movddup {{.*#+}} xmm0 = mem[0,0]		; SSE3-NEXT: movddup {{.*#+}} xmm0 = mem[0,0]
; SSE3-NEXT: retq		; SSE3-NEXT: retq
;		;
▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/widened-broadcast.ll

Show First 20 Lines • Show All 462 Lines • ▼ Show 20 Lines	entry:
%ld = load <32 x i8>, ptr %ptr		%ld = load <32 x i8>, ptr %ptr
%ret = shufflevector <32 x i8> %ld, <32 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%ret = shufflevector <32 x i8> %ld, <32 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
ret <32 x i8> %ret		ret <32 x i8> %ret
}		}

define <4 x float> @load_splat_4f32_8f32_0000(ptr %ptr) nounwind uwtable readnone ssp {		define <4 x float> @load_splat_4f32_8f32_0000(ptr %ptr) nounwind uwtable readnone ssp {
; SSE-LABEL: load_splat_4f32_8f32_0000:		; SSE-LABEL: load_splat_4f32_8f32_0000:
; SSE: # %bb.0: # %entry		; SSE: # %bb.0: # %entry
; SSE-NEXT: movaps (%rdi), %xmm0		; SSE-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,0,0,0]		; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,0,0,0]
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: load_splat_4f32_8f32_0000:		; AVX-LABEL: load_splat_4f32_8f32_0000:
; AVX: # %bb.0: # %entry		; AVX: # %bb.0: # %entry
; AVX-NEXT: vbroadcastss (%rdi), %xmm0		; AVX-NEXT: vbroadcastss (%rdi), %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
entry:		entry:
▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Scalarize vectorized loads that are splattedAbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 489506

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/test/CodeGen/AArch64/sve-fixed-length-splat-vector.ll

llvm/test/CodeGen/AMDGPU/vector_shuffle.packed.ll

llvm/test/CodeGen/PowerPC/canonical-merge-shuffles.ll

llvm/test/CodeGen/WebAssembly/simd-vectorized-load-splat.ll

llvm/test/CodeGen/X86/avx-vbroadcast.ll

llvm/test/CodeGen/X86/half.ll

llvm/test/CodeGen/X86/sse41.ll

llvm/test/CodeGen/X86/vector-shuffle-128-v2.ll

llvm/test/CodeGen/X86/widened-broadcast.ll

[DAGCombiner] Scalarize vectorized loads that are splatted
AbandonedPublic