Download Raw Diff

Details

Reviewers

Pierre-vh
arsenm
rampitec
kerbowa

Summary

If we can not prove that f16 operands of a buildvector are canonicalized, then we can not lower into a V_PACK. In this scenario, we would previously lower into some combination of and(sdwa), shr, or. This patch allows for matching into V_PERM instead -- which uses additional SGPR (or encodes the literal in the instruction itself), but has less VALU latency.

Change-Id: Ifa4a74fdb81ef44f22ba490c7fdf81ec8aebc945

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	870 ms	x64 debian > XRay-x86_64-linux.TestCases/Posix::basic-filtering.cpp

Event Timeline

jrbyrnes created this revision.Sep 22 2022, 11:26 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 11:26 AM

Herald added subscribers: kosarev, foad, kerbowa and 8 others. · View Herald Transcript

jrbyrnes requested review of this revision.Sep 22 2022, 11:26 AM

Harbormaster completed remote builds in B188233: Diff 462251.Sep 22 2022, 11:26 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 11:26 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

jrbyrnes edited the summary of this revision. (Show Details)Sep 22 2022, 12:18 PM

jrbyrnes edited the summary of this revision. (Show Details)

arsenm added a reviewer: Pierre-vh.Sep 22 2022, 12:18 PM

arsenm added inline comments.Sep 22 2022, 12:21 PM

llvm/lib/Target/AMDGPU/SIInstructions.td
2796–2797	Should we just replace this?
2802	Also should cover the integer cases?
llvm/test/CodeGen/AMDGPU/pack.v2f16.ll
0–1	Switching to generated checks should be a separate pre-commit

Hey Matt, thanks for the comments. I'll address them soon, for now I'll add you as reviewer.

rampitec added inline comments.Sep 22 2022, 12:54 PM

llvm/test/CodeGen/AMDGPU/pack.v2f16.ll
0–1	GCN is misleading here. Use something like GFX7_8

jrbyrnes mentioned this in rG5787d4446288: [AMDGPU] Precommit switching test to generated checks for D134463.Sep 23 2022, 10:16 AM

Precommit generated tests pack.v2f16.ll, rebase

Harbormaster completed remote builds in B188445: Diff 462543.Sep 23 2022, 10:27 AM

Fix attributes in test

Harbormaster completed remote builds in B188446: Diff 462545.Sep 23 2022, 10:32 AM

arsenm added inline comments.Sep 23 2022, 11:50 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
2796–2797	This is looking dead
llvm/test/CodeGen/AMDGPU/v_perm_non_canon.ll
4	Probably should move this in with vector_shuffle.packed.ll
5	Should use addrspace(1)*
37	Separate functions for each tested shuffle. Also needs versions using i16.

jrbyrnes mentioned this in rG33ab74ac466f: [AMDGPU] Precommit switching test to generated checks for D134463.Sep 23 2022, 3:13 PM

Address review comments.

Added some patterns to account for the case when we are trying to concat:
v0[0]:v1[1]
v0[1]:v1[0]

Removed some seemingly dead patterns after introducing those.

Pushing as is for potential feedback, still a sort of WIP.

Herald added a subscriber: wenlei. · View Herald TranscriptSep 23 2022, 5:29 PM

jrbyrnes marked 3 inline comments as done.Sep 23 2022, 5:32 PM

Harbormaster completed remote builds in B188505: Diff 462630.Sep 23 2022, 6:39 PM

arsenm mentioned this in D134433: [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR.Sep 26 2022, 7:21 AM

arsenm added inline comments.Sep 26 2022, 7:52 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
2778–2787	Can use a class or foreach over the types to avoid repeating the same pattern twice

jrbyrnes mentioned this in rGe6c29c033899: [AMDGPU] Precommit switching test to generated checks for D134463.Sep 26 2022, 8:14 AM

Can't you use v_alignbit for all the cases where you need the upper 16 bits of one register and the lower 16 bits of the other? It should be smaller than v_perm because the shift amount (16) is an inline constant.

Precommit generated test + Rebase

Consolidate patterns into foreach

Lower to V_ALIGNBIT for D = V[1].low : V[0].hi

In D134463#3817469, @foad wrote:

Can't you use v_alignbit for all the cases where you need the upper 16 bits of one register and the lower 16 bits of the other? It should be smaller than v_perm because the shift amount (16) is an inline constant.

Hey, thanks for the good suggestion! I think this will only work for the case where we want V[1].low : V[0].hi

In the case where we want V[1].hi : V[0].low we can't lower to V_ALIGNBIT_B32 $V0, $V1, 16 because that would incorrectly put the bits from $V0 as the MSBs in the dest. On the other hand V_ALIGNBIT_B32 $V1, $V0, 16 correctly has the bits from $V1 as the MSBs, but they are the lower 16 (and the higher 16 from $V0).

Harbormaster completed remote builds in B189007: Diff 463311.Sep 27 2022, 1:15 PM

In the case where we want V[1].hi : V[0].low we can't lower to V_ALIGNBIT_B32 $V0, $V1, 16 because that would incorrectly put the bits from $V0 as the MSBs in the dest. On the other hand V_ALIGNBIT_B32 $V1, $V0, 16 correctly has the bits from $V1 as the MSBs, but they are the lower 16 (and the higher 16 from $V0).

Good point. You could use V_BFI_B32 but I guess that is no better or worse than V_PERM_B32.

In D134463#3820197, @foad wrote:

In the case where we want V[1].hi : V[0].low we can't lower to V_ALIGNBIT_B32 $V0, $V1, 16 because that would incorrectly put the bits from $V0 as the MSBs in the dest. On the other hand V_ALIGNBIT_B32 $V1, $V0, 16 correctly has the bits from $V1 as the MSBs, but they are the lower 16 (and the higher 16 from $V0).

Good point. You could use V_BFI_B32 but I guess that is no better or worse than V_PERM_B32.

One small point in favor of BFI is the bitmask you need is more likely CSEable for unrelated uses

Use V_BFI for V[1].hi : V[0].low . This allows for a bitmask which is more likely to be reused by other instructions (0xffff vs 0x7060100), potentially enabling other optimizations (e.g. CSE)

One small point in favor of BFI is the bitmask you need is more likely CSEable for unrelated uses

Thanks, good point. Changed the pattern in favor of BFI.

jrbyrnes added inline comments.Sep 28 2022, 12:03 PM

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
240 ↗	(On Diff #463635)	This seems illegal to me -- using SGPR and literal as operands to VALU. Looking into it.

rampitec added inline comments.Sep 28 2022, 12:05 PM

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
240 ↗	(On Diff #463635)	0 is inline literal and is free.

Harbormaster completed remote builds in B189226: Diff 463635.Sep 28 2022, 12:31 PM

foad added inline comments.Sep 29 2022, 1:09 AM

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
240 ↗	(On Diff #463635)	As a code quality thing, this could have been optimized to `v_and_b32 v1, 0xffff0000, v0`

jrbyrnes marked 2 inline comments as done.Sep 29 2022, 11:11 AM

jrbyrnes added inline comments.

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
240 ↗	(On Diff #463635)	Stas -- I see, thanks! Jay -- Interesting, I'll look into what's going on with the literal. As a side note, CodeGen is actually not good for this particular test. It seems to me the whole test can be combined into a 32 bit load. D133584 should be extended to handle this i16s, in which case this whole test will be optimized to a load.

jrbyrnes marked an inline comment as done.Sep 29 2022, 11:11 AM

jrbyrnes marked an inline comment as not done.

arsenm added inline comments.Sep 29 2022, 11:39 AM

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
240 ↗	(On Diff #463635)	This could only be a 16-bit load if unaligned access is enabled (and I think we previously decided that doing unaligned 16-bit loads was probably worse than byte loads). The load question is orthogonal to how the bit masking should have been emitted

Add pattern to select V_AND v1, 0xffff000 in the case where buildvector produces bits V1.hi : 0

Add test coverage for pattern.

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
240 ↗	(On Diff #463635)	Right -- good to know about decision to use byte loads. I agree it is a bit off topic for this review.

Harbormaster completed remote builds in B189557: Diff 464096.Sep 29 2022, 6:01 PM

arsenm mentioned this in D134967: [AMDGPU] Always lower SHUFFLE_VECTOR.Sep 30 2022, 8:02 AM

LGTM

This revision is now accepted and ready to land.Sep 30 2022, 9:02 AM

Rebased to trunk && made necessary test modifications. NFC

Landed via rGf4e6149d8217

Harbormaster completed remote builds in B190045: Diff 464779.Oct 3 2022, 2:03 PM

Diff 462543

llvm/lib/Target/AMDGPU/SIInstructions.td

	Show First 20 Lines • Show All 2,769 Lines • ▼ Show 20 Lines
	>;			>;

	def : GCNPat <			def : GCNPat <
	(v2i16 (DivergentBinFrag<build_vector> (i16 SReg_32:$src0), (i16 (trunc (srl_oneuse SReg_32:$src1, (i32 16)))))),			(v2i16 (DivergentBinFrag<build_vector> (i16 SReg_32:$src0), (i16 (trunc (srl_oneuse SReg_32:$src1, (i32 16)))))),
	(v2i16 (V_BFI_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), SReg_32:$src0, SReg_32:$src1))			(v2i16 (V_BFI_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), SReg_32:$src0, SReg_32:$src1))
	>;			>;


	def : GCNPat <			def : GCNPat <
	(v2i16 (UniformBinFrag<build_vector> (i16 (trunc (srl_oneuse SReg_32:$src0, (i32 16)))),			(v2i16 (UniformBinFrag<build_vector> (i16 (trunc (srl_oneuse SReg_32:$src0, (i32 16)))),
	(i16 (trunc (srl_oneuse SReg_32:$src1, (i32 16)))))),			(i16 (trunc (srl_oneuse SReg_32:$src1, (i32 16)))))),
	(S_PACK_HH_B32_B16 SReg_32:$src0, SReg_32:$src1)			(S_PACK_HH_B32_B16 SReg_32:$src0, SReg_32:$src1)
	>;			>;

	def : GCNPat <			def : GCNPat <
	(v2i16 (DivergentBinFrag<build_vector> (i16 (trunc (srl_oneuse SReg_32:$src0, (i32 16)))),			(v2i16 (DivergentBinFrag<build_vector> (i16 (trunc (srl_oneuse SReg_32:$src0, (i32 16)))),
	(i16 (trunc (srl_oneuse SReg_32:$src1, (i32 16)))))),			(i16 (trunc (srl_oneuse SReg_32:$src1, (i32 16)))))),
	(v2i16 (V_AND_OR_B32_e64 SReg_32:$src1, (i32 (V_MOV_B32_e32 (i32 0xffff0000))), (i32 (V_LSHRREV_B32_e64 (i32 16), SReg_32:$src0))))			(v2i16 (V_AND_OR_B32_e64 SReg_32:$src1, (i32 (V_MOV_B32_e32 (i32 0xffff0000))), (i32 (V_LSHRREV_B32_e64 (i32 16), SReg_32:$src0))))
				arsenmUnsubmitted Done Reply Inline Actions Can use a class or foreach over the types to avoid repeating the same pattern twice arsenm: Can use a class or foreach over the types to avoid repeating the same pattern twice
	>;			>;

	def : GCNPat <			def : GCNPat <
	(v2f16 (UniformBinFrag<build_vector> (f16 SReg_32:$src0), (f16 SReg_32:$src1))),			(v2f16 (UniformBinFrag<build_vector> (f16 SReg_32:$src0), (f16 SReg_32:$src1))),
	(S_PACK_LL_B32_B16 SReg_32:$src0, SReg_32:$src1)			(S_PACK_LL_B32_B16 SReg_32:$src0, SReg_32:$src1)
	>;			>;

	def : GCNPat <			def : GCNPat <
	(v2f16 (DivergentBinFrag<build_vector> (f16 SReg_32:$src0), (f16 SReg_32:$src1))),			(v2f16 (DivergentBinFrag<build_vector> (f16 SReg_32:$src0), (f16 SReg_32:$src1))),
	(v2f16 (V_LSHL_OR_B32_e64 SReg_32:$src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), SReg_32:$src0))))			(v2f16 (V_LSHL_OR_B32_e64 SReg_32:$src1, (i32 16), (i32 (V_AND_B32_e64 (i32 (V_MOV_B32_e32 (i32 0xffff))), SReg_32:$src0))))
				arsenmUnsubmitted Done Reply Inline Actions Should we just replace this? arsenm: Should we just replace this?
				arsenmUnsubmitted Done Reply Inline Actions This is looking dead arsenm: This is looking dead
	>;			>;

				// Take the lower 16 bits from each VGPR_32 and concat them
				def : GCNPat <
				(i32 (bitconvert (v2f16 (DivergentBinFrag<build_vector> (f16 (bitconvert (i16 (trunc VGPR_32:$a)))), (f16 (bitconvert (i16 (trunc VGPR_32:$b)))))))),
				arsenmUnsubmitted Done Reply Inline Actions Also should cover the integer cases? arsenm: Also should cover the integer cases?
				(V_PERM_B32_e64 VGPR_32:$b, VGPR_32:$a, (S_MOV_B32 (i32 0x05040100)))
				>;

				// Take the upper 16 bits from each VGPR_32 and concat them
				def : GCNPat <
				(i32 (bitconvert (v2f16 (DivergentBinFrag<build_vector> (f16 (bitconvert (i16 (trunc (srl VGPR_32:$a, (i32 16)))))), (f16 (bitconvert (i16 (trunc (srl VGPR_32:$b, (i32 16)))))))))),
				(V_PERM_B32_e64 VGPR_32:$b, VGPR_32:$a, (S_MOV_B32 (i32 0x07060302)))
				>;

				let AddedComplexity = 5 in {
	def : GCNPat <			def : GCNPat <
	(v2f16 (is_canonicalized<build_vector> (f16 (VOP3Mods (f16 VGPR_32:$src0), i32:$src0_mods)),			(v2f16 (is_canonicalized<build_vector> (f16 (VOP3Mods (f16 VGPR_32:$src0), i32:$src0_mods)),
	(f16 (VOP3Mods (f16 VGPR_32:$src1), i32:$src1_mods)))),			(f16 (VOP3Mods (f16 VGPR_32:$src1), i32:$src1_mods)))),
	(V_PACK_B32_F16_e64 $src0_mods, VGPR_32:$src0, $src1_mods, VGPR_32:$src1)			(V_PACK_B32_F16_e64 $src0_mods, VGPR_32:$src0, $src1_mods, VGPR_32:$src1)
	>;			>;
				}
	} // End SubtargetPredicate = HasVOP3PInsts			} // End SubtargetPredicate = HasVOP3PInsts

	// With multiple uses of the shift, this will duplicate the shift and			// With multiple uses of the shift, this will duplicate the shift and
	// increase register pressure.			// increase register pressure.
	let SubtargetPredicate = isGFX11Plus in			let SubtargetPredicate = isGFX11Plus in
	def : GCNPat <			def : GCNPat <
	(v2i16 (build_vector (i16 (trunc (srl_oneuse SReg_32:$src0, (i32 16)))), (i16 SReg_32:$src1))),			(v2i16 (build_vector (i16 (trunc (srl_oneuse SReg_32:$src0, (i32 16)))), (i16 SReg_32:$src1))),
	(v2i16 (S_PACK_HL_B32_B16 SReg_32:$src0, SReg_32:$src1))			(v2i16 (S_PACK_HL_B32_B16 SReg_32:$src0, SReg_32:$src1))
	▲ Show 20 Lines • Show All 609 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/build-vector-packed-partial-undef.ll

Show First 20 Lines • Show All 172 Lines • ▼ Show 20 Lines	; GFX8-NEXT: s_setpc_b64 s[30:31]
call void asm sideeffect "; use $0", "v"(<4 x i16> %undef.lo);		call void asm sideeffect "; use $0", "v"(<4 x i16> %undef.lo);
ret void		ret void
}		}

define void @undef_lo2_v4f16(<2 x half> %arg0) {		define void @undef_lo2_v4f16(<2 x half> %arg0) {
; GFX9-LABEL: undef_lo2_v4f16:		; GFX9-LABEL: undef_lo2_v4f16:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: v_lshrrev_b32_e32 v0, 16, v0		; GFX9-NEXT: s_mov_b32 s4, 0x7060302
; GFX9-NEXT: v_and_b32_e32 v1, 0xffff, v0		; GFX9-NEXT: v_perm_b32 v0, v0, v0, s4
; GFX9-NEXT: v_lshl_or_b32 v0, v0, 16, v1
; GFX9-NEXT: ;;#ASMSTART		; GFX9-NEXT: ;;#ASMSTART
; GFX9-NEXT: ; use v[0:1]		; GFX9-NEXT: ; use v[0:1]
; GFX9-NEXT: ;;#ASMEND		; GFX9-NEXT: ;;#ASMEND
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX8-LABEL: undef_lo2_v4f16:		; GFX8-LABEL: undef_lo2_v4f16:
; GFX8: ; %bb.0:		; GFX8: ; %bb.0:
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll

	Show First 20 Lines • Show All 383 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: v_mov_b32_e32 v2, s8			; GFX9-NEXT: v_mov_b32_e32 v2, s8
	; GFX9-NEXT: v_mov_b32_e32 v3, s9			; GFX9-NEXT: v_mov_b32_e32 v3, s9
	; GFX9-NEXT: v_mov_b32_e32 v4, s10			; GFX9-NEXT: v_mov_b32_e32 v4, s10
	; GFX9-NEXT: v_mov_b32_e32 v5, s11			; GFX9-NEXT: v_mov_b32_e32 v5, s11
	; GFX9-NEXT: .LBB2_3: ; %T			; GFX9-NEXT: .LBB2_3: ; %T
	; GFX9-NEXT: global_load_dwordx4 v[2:5], v[0:1], off glc			; GFX9-NEXT: global_load_dwordx4 v[2:5], v[0:1], off glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: .LBB2_4: ; %exit			; GFX9-NEXT: .LBB2_4: ; %exit
				; GFX9-NEXT: s_mov_b32 s4, 0x5040100
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_and_b32_e32 v0, 0xffff, v3			; GFX9-NEXT: v_perm_b32 v0, v3, v3, s4
	; GFX9-NEXT: v_lshl_or_b32 v0, v3, 16, v0
	; GFX9-NEXT: v_mov_b32_e32 v1, 0x3800			; GFX9-NEXT: v_mov_b32_e32 v1, 0x3800
	; GFX9-NEXT: v_mov_b32_e32 v3, 0x3900			; GFX9-NEXT: v_mov_b32_e32 v3, 0x3900
	; GFX9-NEXT: v_mov_b32_e32 v4, 0x3d00			; GFX9-NEXT: v_mov_b32_e32 v4, 0x3d00
	; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v0			; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v0
	; GFX9-NEXT: v_cndmask_b32_e32 v5, v3, v4, vcc			; GFX9-NEXT: v_cndmask_b32_e32 v5, v3, v4, vcc
	; GFX9-NEXT: v_cmp_nle_f16_sdwa vcc, v0, v1 src0_sel:WORD_1 src1_sel:DWORD			; GFX9-NEXT: v_cmp_nle_f16_sdwa vcc, v0, v1 src0_sel:WORD_1 src1_sel:DWORD
	; GFX9-NEXT: v_cndmask_b32_e32 v6, v4, v3, vcc			; GFX9-NEXT: v_cndmask_b32_e32 v6, v4, v3, vcc
	; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v2			; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v2
	▲ Show 20 Lines • Show All 547 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: .LBB5_3: ; %T			; GFX9-NEXT: .LBB5_3: ; %T
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dwordx4 v[2:5], v[0:1], off offset:16 glc			; GFX9-NEXT: global_load_dwordx4 v[2:5], v[0:1], off offset:16 glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dwordx4 v[4:7], v[0:1], off glc			; GFX9-NEXT: global_load_dwordx4 v[4:7], v[0:1], off glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: ; kill: killed $vgpr0 killed $vgpr1			; GFX9-NEXT: ; kill: killed $vgpr0 killed $vgpr1
	; GFX9-NEXT: .LBB5_4: ; %exit			; GFX9-NEXT: .LBB5_4: ; %exit
				; GFX9-NEXT: s_mov_b32 s4, 0x5040100
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_and_b32_e32 v0, 0xffff, v5			; GFX9-NEXT: v_perm_b32 v0, v5, v5, s4
	; GFX9-NEXT: v_lshl_or_b32 v0, v5, 16, v0
	; GFX9-NEXT: v_mov_b32_e32 v1, 0x3800			; GFX9-NEXT: v_mov_b32_e32 v1, 0x3800
	; GFX9-NEXT: v_mov_b32_e32 v2, 0x3900			; GFX9-NEXT: v_mov_b32_e32 v2, 0x3900
	; GFX9-NEXT: v_mov_b32_e32 v3, 0x3d00			; GFX9-NEXT: v_mov_b32_e32 v3, 0x3d00
	; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v0			; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v0
	; GFX9-NEXT: v_cndmask_b32_e32 v5, v2, v3, vcc			; GFX9-NEXT: v_cndmask_b32_e32 v5, v2, v3, vcc
	; GFX9-NEXT: v_cmp_nle_f16_sdwa vcc, v0, v1 src0_sel:WORD_1 src1_sel:DWORD			; GFX9-NEXT: v_cmp_nle_f16_sdwa vcc, v0, v1 src0_sel:WORD_1 src1_sel:DWORD
	; GFX9-NEXT: v_cndmask_b32_e32 v6, v3, v2, vcc			; GFX9-NEXT: v_cndmask_b32_e32 v6, v3, v2, vcc
	; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v4			; GFX9-NEXT: v_cmp_ge_f16_e32 vcc, 0.5, v4
	Show All 23 Lines

llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll

	Show First 20 Lines • Show All 2,193 Lines • ▼ Show 20 Lines
	}			}

	define amdgpu_kernel void @v_insertelement_v8f16_3(<8 x half> addrspace(1)* %out, <8 x half> addrspace(1)* %in, i32 %val) {			define amdgpu_kernel void @v_insertelement_v8f16_3(<8 x half> addrspace(1)* %out, <8 x half> addrspace(1)* %in, i32 %val) {
	; GFX9-LABEL: v_insertelement_v8f16_3:			; GFX9-LABEL: v_insertelement_v8f16_3:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; GFX9-NEXT: s_load_dword s6, s[4:5], 0x10			; GFX9-NEXT: s_load_dword s6, s[4:5], 0x10
	; GFX9-NEXT: v_lshlrev_b32_e32 v4, 4, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v4, 4, v0
				; GFX9-NEXT: v_mov_b32_e32 v5, 0x5040100
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dwordx4 v[0:3], v4, s[2:3]			; GFX9-NEXT: global_load_dwordx4 v[0:3], v4, s[2:3]
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_and_b32_e32 v1, 0xffff, v1			; GFX9-NEXT: v_perm_b32 v1, s6, v1, v5
	; GFX9-NEXT: v_lshl_or_b32 v1, s6, 16, v1
	; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[0:1]			; GFX9-NEXT: global_store_dwordx4 v4, v[0:3], s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_insertelement_v8f16_3:			; VI-LABEL: v_insertelement_v8f16_3:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; VI-NEXT: s_load_dword s4, s[4:5], 0x10			; VI-NEXT: s_load_dword s4, s[4:5], 0x10
	; VI-NEXT: v_lshlrev_b32_e32 v4, 4, v0			; VI-NEXT: v_lshlrev_b32_e32 v4, 4, v0
	Show All 37 Lines
	; GFX11-LABEL: v_insertelement_v8f16_3:			; GFX11-LABEL: v_insertelement_v8f16_3:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_load_b128 s[4:7], s[0:1], 0x0			; GFX11-NEXT: s_load_b128 s[4:7], s[0:1], 0x0
	; GFX11-NEXT: v_lshlrev_b32_e32 v4, 4, v0			; GFX11-NEXT: v_lshlrev_b32_e32 v4, 4, v0
	; GFX11-NEXT: s_load_b32 s0, s[0:1], 0x10			; GFX11-NEXT: s_load_b32 s0, s[0:1], 0x10
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: global_load_b128 v[0:3], v4, s[6:7]			; GFX11-NEXT: global_load_b128 v[0:3], v4, s[6:7]
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_and_b32_e32 v1, 0xffff, v1			; GFX11-NEXT: v_perm_b32 v1, s0, v1, 0x5040100
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
	; GFX11-NEXT: v_lshl_or_b32 v1, s0, 16, v1
	; GFX11-NEXT: global_store_b128 v4, v[0:3], s[4:5]			; GFX11-NEXT: global_store_b128 v4, v[0:3], s[4:5]
	; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; GFX11-NEXT: s_endpgm			; GFX11-NEXT: s_endpgm
	%tid = call i32 @llvm.amdgcn.workitem.id.x() #1			%tid = call i32 @llvm.amdgcn.workitem.id.x() #1
	%tid.ext = sext i32 %tid to i64			%tid.ext = sext i32 %tid to i64
	%in.gep = getelementptr inbounds <8 x half>, <8 x half> addrspace(1)* %in, i64 %tid.ext			%in.gep = getelementptr inbounds <8 x half>, <8 x half> addrspace(1)* %in, i64 %tid.ext
	%out.gep = getelementptr inbounds <8 x half>, <8 x half> addrspace(1)* %out, i64 %tid.ext			%out.gep = getelementptr inbounds <8 x half>, <8 x half> addrspace(1)* %out, i64 %tid.ext
	%vec = load <8 x half>, <8 x half> addrspace(1)* %in.gep			%vec = load <8 x half>, <8 x half> addrspace(1)* %in.gep
	▲ Show 20 Lines • Show All 317 Lines • ▼ Show 20 Lines
	}			}

	define amdgpu_kernel void @v_insertelement_v16f16_3(<16 x half> addrspace(1)* %out, <16 x half> addrspace(1)* %in, i32 %val) {			define amdgpu_kernel void @v_insertelement_v16f16_3(<16 x half> addrspace(1)* %out, <16 x half> addrspace(1)* %in, i32 %val) {
	; GFX9-LABEL: v_insertelement_v16f16_3:			; GFX9-LABEL: v_insertelement_v16f16_3:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; GFX9-NEXT: s_load_dword s6, s[4:5], 0x10			; GFX9-NEXT: s_load_dword s6, s[4:5], 0x10
	; GFX9-NEXT: v_lshlrev_b32_e32 v8, 5, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v8, 5, v0
				; GFX9-NEXT: v_mov_b32_e32 v9, 0x5040100
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dwordx4 v[0:3], v8, s[2:3]			; GFX9-NEXT: global_load_dwordx4 v[0:3], v8, s[2:3]
	; GFX9-NEXT: global_load_dwordx4 v[4:7], v8, s[2:3] offset:16			; GFX9-NEXT: global_load_dwordx4 v[4:7], v8, s[2:3] offset:16
	; GFX9-NEXT: s_waitcnt vmcnt(1)			; GFX9-NEXT: s_waitcnt vmcnt(1)
	; GFX9-NEXT: v_and_b32_e32 v1, 0xffff, v1			; GFX9-NEXT: v_perm_b32 v1, s6, v1, v9
	; GFX9-NEXT: v_lshl_or_b32 v1, s6, 16, v1
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_store_dwordx4 v8, v[4:7], s[0:1] offset:16			; GFX9-NEXT: global_store_dwordx4 v8, v[4:7], s[0:1] offset:16
	; GFX9-NEXT: global_store_dwordx4 v8, v[0:3], s[0:1]			; GFX9-NEXT: global_store_dwordx4 v8, v[0:3], s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_insertelement_v16f16_3:			; VI-LABEL: v_insertelement_v16f16_3:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
	; GFX11-NEXT: s_load_b128 s[4:7], s[0:1], 0x0			; GFX11-NEXT: s_load_b128 s[4:7], s[0:1], 0x0
	; GFX11-NEXT: v_lshlrev_b32_e32 v8, 5, v0			; GFX11-NEXT: v_lshlrev_b32_e32 v8, 5, v0
	; GFX11-NEXT: s_load_b32 s0, s[0:1], 0x10			; GFX11-NEXT: s_load_b32 s0, s[0:1], 0x10
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: s_clause 0x1			; GFX11-NEXT: s_clause 0x1
	; GFX11-NEXT: global_load_b128 v[0:3], v8, s[6:7]			; GFX11-NEXT: global_load_b128 v[0:3], v8, s[6:7]
	; GFX11-NEXT: global_load_b128 v[4:7], v8, s[6:7] offset:16			; GFX11-NEXT: global_load_b128 v[4:7], v8, s[6:7] offset:16
	; GFX11-NEXT: s_waitcnt vmcnt(1)			; GFX11-NEXT: s_waitcnt vmcnt(1)
	; GFX11-NEXT: v_and_b32_e32 v1, 0xffff, v1			; GFX11-NEXT: v_perm_b32 v1, s0, v1, 0x5040100
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
	; GFX11-NEXT: v_lshl_or_b32 v1, s0, 16, v1
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: s_clause 0x1			; GFX11-NEXT: s_clause 0x1
	; GFX11-NEXT: global_store_b128 v8, v[4:7], s[4:5] offset:16			; GFX11-NEXT: global_store_b128 v8, v[4:7], s[4:5] offset:16
	; GFX11-NEXT: global_store_b128 v8, v[0:3], s[4:5]			; GFX11-NEXT: global_store_b128 v8, v[0:3], s[4:5]
	; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; GFX11-NEXT: s_endpgm			; GFX11-NEXT: s_endpgm
	%tid = call i32 @llvm.amdgcn.workitem.id.x() #1			%tid = call i32 @llvm.amdgcn.workitem.id.x() #1
	%tid.ext = sext i32 %tid to i64			%tid.ext = sext i32 %tid to i64
	▲ Show 20 Lines • Show All 530 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/pack.v2f16.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				arsenmUnsubmitted Done Reply Inline Actions Switching to generated checks should be a separate pre-commit arsenm: Switching to generated checks should be a separate pre-commit
				rampitecUnsubmitted Done Reply Inline Actions GCN is misleading here. Use something like GFX7_8 rampitec: GCN is misleading here. Use something like GFX7_8
	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx900 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs < %s \| FileCheck --check-prefixes=GFX9 %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=gfx900 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs < %s \| FileCheck --check-prefixes=GFX9 %s
	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX8 %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX8 %s
	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX7 %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GFX7 %s


	define amdgpu_kernel void @s_pack_v2f16(i32 addrspace(4)* %in0, i32 addrspace(4)* %in1) #0 {			define amdgpu_kernel void @s_pack_v2f16(i32 addrspace(4)* %in0, i32 addrspace(4)* %in1) #0 {
	; GFX9-LABEL: s_pack_v2f16:			; GFX9-LABEL: s_pack_v2f16:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dword v1, v0, s[0:1] glc			; GFX9-NEXT: global_load_dword v1, v0, s[0:1] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v2, v0, s[2:3] glc			; GFX9-NEXT: global_load_dword v2, v0, s[2:3] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_and_b32_e32 v0, 0xffff, v1			; GFX9-NEXT: s_mov_b32 s0, 0x5040100
	; GFX9-NEXT: v_lshl_or_b32 v0, v2, 16, v0			; GFX9-NEXT: v_perm_b32 v0, v2, v1, s0
	; GFX9-NEXT: ;;#ASMSTART			; GFX9-NEXT: ;;#ASMSTART
	; GFX9-NEXT: ; use v0			; GFX9-NEXT: ; use v0
	; GFX9-NEXT: ;;#ASMEND			; GFX9-NEXT: ;;#ASMEND
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX8-LABEL: v_pack_v2f16:			; GFX8-LABEL: v_pack_v2f16:
	; GFX8: ; %bb.0:			; GFX8: ; %bb.0:
	; GFX8-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX8-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dword v1, v0, s[0:1] glc			; GFX9-NEXT: global_load_dword v1, v0, s[0:1] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v2, v0, s[2:3] glc			; GFX9-NEXT: global_load_dword v2, v0, s[2:3] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: s_mov_b32 s0, 0x5040100
	; GFX9-NEXT: s_mov_b32 s3, 0xf000			; GFX9-NEXT: s_mov_b32 s3, 0xf000
	; GFX9-NEXT: s_mov_b32 s2, -1			; GFX9-NEXT: s_mov_b32 s2, -1
	; GFX9-NEXT: v_and_b32_e32 v0, 0xffff, v1			; GFX9-NEXT: v_perm_b32 v0, v2, v1, s0
	; GFX9-NEXT: v_lshl_or_b32 v0, v2, 16, v0
	; GFX9-NEXT: v_add_u32_e32 v0, 9, v0			; GFX9-NEXT: v_add_u32_e32 v0, 9, v0
	; GFX9-NEXT: buffer_store_dword v0, off, s[0:3], 0			; GFX9-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX8-LABEL: v_pack_v2f16_user:			; GFX8-LABEL: v_pack_v2f16_user:
	; GFX8: ; %bb.0:			; GFX8: ; %bb.0:
	; GFX8-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX8-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	▲ Show 20 Lines • Show All 373 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/v_perm_non_canon.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s

				define amdgpu_kernel void @transpose_fp16(<2 x half>* nocapture noundef nonnull readonly align 4 dereferenceable(4) %x0, <2 x half>* nocapture noundef nonnull readonly align 4 dereferenceable(4) %x1, <2 x half>* nocapture noundef nonnull writeonly align 4 dereferenceable(4) %y0, <2 x half>* nocapture noundef nonnull writeonly align 4 dereferenceable(4) %y1) {
				arsenmUnsubmitted Done Reply Inline Actions Probably should move this in with vector_shuffle.packed.ll arsenm: Probably should move this in with vector_shuffle.packed.ll
				; GCN-LABEL: transpose_fp16:
				arsenmUnsubmitted Done Reply Inline Actions Should use addrspace(1)* arsenm: Should use addrspace(1)*
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
				; GCN-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-NEXT: v_mov_b32_e32 v0, s0
				; GCN-NEXT: v_mov_b32_e32 v1, s1
				; GCN-NEXT: v_mov_b32_e32 v2, s2
				; GCN-NEXT: v_mov_b32_e32 v3, s3
				; GCN-NEXT: flat_load_dword v4, v[0:1]
				; GCN-NEXT: flat_load_dword v5, v[2:3]
				; GCN-NEXT: s_mov_b32 s0, 0x5040100
				; GCN-NEXT: s_mov_b32 s1, 0x7060302
				; GCN-NEXT: v_mov_b32_e32 v0, s4
				; GCN-NEXT: v_mov_b32_e32 v1, s5
				; GCN-NEXT: v_mov_b32_e32 v2, s6
				; GCN-NEXT: v_mov_b32_e32 v3, s7
				; GCN-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GCN-NEXT: v_perm_b32 v6, v5, v4, s0
				; GCN-NEXT: v_perm_b32 v4, v5, v4, s1
				; GCN-NEXT: flat_store_dword v[0:1], v6
				; GCN-NEXT: flat_store_dword v[2:3], v4
				; GCN-NEXT: s_endpgm
				entry:
				%0 = load <2 x half>, <2 x half>* %x0, align 4
				%1 = load <2 x half>, <2 x half>* %x1, align 4
				%vy0.2.vec.insert = shufflevector <2 x half> %0, <2 x half> %1, <2 x i32> <i32 0, i32 2>
				%vy1.0.vec.insert = shufflevector <2 x half> %0, <2 x half> poison, <2 x i32> <i32 1, i32 undef>
				%vy1.2.vec.insert = shufflevector <2 x half> %vy1.0.vec.insert, <2 x half> %1, <2 x i32> <i32 0, i32 3>
				store <2 x half> %vy0.2.vec.insert, <2 x half>* %y0, align 4
				store <2 x half> %vy1.2.vec.insert, <2 x half>* %y1, align 4
				ret void
				}
				arsenmUnsubmitted Not Done Reply Inline Actions Separate functions for each tested shuffle. Also needs versions using i16. arsenm: Separate functions for each tested shuffle. Also needs versions using i16.

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use V_PERM to match buildvectors when inputs are not canonicalized (i.e. can't use V_PACK)
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 462543

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/test/CodeGen/AMDGPU/build-vector-packed-partial-undef.ll

llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll

llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll

llvm/test/CodeGen/AMDGPU/pack.v2f16.ll

llvm/test/CodeGen/AMDGPU/v_perm_non_canon.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Use V_PERM to match buildvectors when inputs are not canonicalized (i.e. can't use V_PACK)ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 462543

llvm/lib/Target/AMDGPU/SIInstructions.td

llvm/test/CodeGen/AMDGPU/build-vector-packed-partial-undef.ll

llvm/test/CodeGen/AMDGPU/extract-subvector-16bit.ll

llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll

llvm/test/CodeGen/AMDGPU/pack.v2f16.ll

llvm/test/CodeGen/AMDGPU/v_perm_non_canon.ll

[AMDGPU] Use V_PERM to match buildvectors when inputs are not canonicalized (i.e. can't use V_PACK)
ClosedPublic