This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
1
VOP3PInstructions.td
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
5
idot8.ll

Differential D51947

[AMDGPU] Match udot8 pattern
ClosedPublic

Authored by FarhanaAleen on Sep 11 2018, 1:39 PM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm
nhaehnle

Commits

rGf5a2848376b4: [AMDGPU] Match udot8 pattern
rL342497: [AMDGPU] Match udot8 pattern

Summary

D.u32 = S0.u4[0] * S1.u4[0] +

S0.u4[1] * S1.u4[1] +
S0.u4[2] * S1.u4[2] + 
S0.u4[3] * S1.u4[3] +
S0.u4[4] * S1.u4[4] + 
S0.u4[5] * S1.u4[5] +
S0.u4[6] * S1.u4[6] + 
S0.u4[7] * S1.u4[7] +
S2.u32

Negated form will be supported with idot8.

Diff Detail

Event Timeline

FarhanaAleen created this revision.Sep 11 2018, 1:39 PM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptSep 11 2018, 1:39 PM

As for the testcases, what about vectorized multiplicaton, i.e.:

%vec1 = load <8 x i4>, ...
vec2 = load <8 x i4>, ...
%ext1 = zext <8 x i4> %vec1 to <8 x i32>
%ext2 = zext <8 x i4> %vec2 to <8 x i32>
%mul = mul nuw nsw <8 x i32> %ext1, %ext2
... then extractelement and add up the result
... or possibly the same thing without the zext

The TableGen itself looks good to me, except for one nitpick (inline).

lib/Target/AMDGPU/VOP3PInstructions.td
206–210	I realize this was already done this way for the 8bit case, but it would be cleaner to use Index 0-7 instead 1-8. 0-based indexing is more natural here, and would avoid the `!add(.., -1)`.

Thanks Nicolai.

Added testcases for the vector version of the multiplication.
Also updated the pattern to start with 0.

arsenm added inline comments.Sep 12 2018, 8:14 PM

test/CodeGen/AMDGPU/idot8.ll
298–307	This is a lot of hardcoded registers for a non-generated test. Either add the checks or use update_llc_test_checks?

FarhanaAleen added inline comments.Sep 13 2018, 12:25 PM

test/CodeGen/AMDGPU/idot8.ll
298–307	Hi Matt, This is generated by update_llc_test_checks with a little modification. I just used common label for some of the common instructions and removed some of the instructions that are unrelated with this testing in order to have a concise form.

nhaehnle added inline comments.Sep 14 2018, 1:33 AM

test/CodeGen/AMDGPU/idot8.ll
298–307	But this means somebody will have to manually repeat that work when some arbitrary register allocation changes happen. Better to just take the raw output from update_llc_test_checks, since the common label/instructions don't actually seem to save that many lines.

Updated test checks purely generated by update_llc_test_checks.

Thanks, this mostly looks good to me. Looks like this may be running into a serious limitation of the ISel infrastructure with commutativity / associativity, but it makes sense to land this patch without addressing it. I do have one last question.

test/CodeGen/AMDGPU/idot8.ll
415–425	Why isn't this testcase using the v_dot instruction? Could this be fixed relatively easily by extending the `MulU#Index#"_4bit"` PatFrag with a second pattern?

In D51947#1236628, @nhaehnle wrote:

Thanks, this mostly looks good to me. Looks like this may be running into a serious limitation of the ISel infrastructure with commutativity / associativity, but it makes sense to land this patch without addressing it. I do have one last question.

I have been thinking about different solutions to handle it. One easiest solution would be to put a threshold during permutation. Thanks, yes I would like to go ahead with this patch.

test/CodeGen/AMDGPU/idot8.ll
415–425	Yes, it can be fixed by writing a second pattern but that would be a workaround. The issue here is the additional 'and' instruction which should be removed(it's not generated in the earlier generation such as GFX7). Otherwise it will keep causing problems somewhere else and we might end up keep adding patterns or additional unnecessary/temporary code to work around it all those places. I will look in to it as a separate action.

Okay, thanks for the explanation, that seems fair.

This revision is now accepted and ready to land.Sep 18 2018, 3:55 AM

Closed by commit rL342497: [AMDGPU] Match udot8 pattern (authored by faaleen). · Explain WhySep 18 2018, 10:02 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptSep 18 2018, 10:02 AM

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

VOP3PInstructions.td

69 lines

test/

CodeGen/

AMDGPU/

idot8.ll

1204 lines

Diff 165175

lib/Target/AMDGPU/VOP3PInstructions.td

	Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines
	}			}

	defm : MadFmaMixPats<fma, V_FMA_MIX_F32, V_FMA_MIXLO_F16, V_FMA_MIXHI_F16>;			defm : MadFmaMixPats<fma, V_FMA_MIX_F32, V_FMA_MIXLO_F16, V_FMA_MIXHI_F16>;
	}			}

	class Srl<int N> : PatFrag<(ops node:$src),			class Srl<int N> : PatFrag<(ops node:$src),
	(srl node:$src, (i32 N))>;			(srl node:$src, (i32 N))>;

	foreach Bits = [8, 16, 24] in {			foreach Bits = 1-7 in
	def srl#Bits : Srl<Bits>;			def srl#!shl(Bits, 2) : Srl<!shl(Bits, 2)>;
	}

	def and_255 : PatFrag<
	(ops node:$src0), (and node:$src0, (i32 255))
	>;

	class Extract_U8<int FromBitIndex> : PatFrag<(			class Extract_U<int FromBitIndex, int BitMask> : PatFrag<
	ops node:$src),			(ops node:$src),
	!if (!eq (FromBitIndex, 24), // last element			!if (!or (!and (!eq (BitMask, 255), !eq (FromBitIndex, 24)),
				!and (!eq (BitMask, 15), !eq (FromBitIndex, 28))), // last element
	(!cast<Srl>("srl"#FromBitIndex) node:$src),			(!cast<Srl>("srl"#FromBitIndex) node:$src),
	!if (!eq (FromBitIndex, 0), // first element			!if (!eq (FromBitIndex, 0), // first element
	(and_255 node:$src),			(and node:$src, (i32 BitMask)),
	(and_255 (!cast<Srl>("srl"#FromBitIndex) node:$src))))>;			(and (!cast<Srl>("srl"#FromBitIndex) node:$src), (i32 BitMask))))>;

	// Defines patterns that extract each Index'ed 8bit from a 32bit scalar value;			foreach Index = 0-3 in {
	foreach Index = [1, 2, 3, 4] in {			// Defines patterns that extract each Index'ed 8bit from an unsigned
	def UElt#Index : Extract_U8<!shl(!add(Index, -1), 3)>;			// 32bit scalar value;
	}			def U#Index#"_8bit" : Extract_U<!shl(Index, 3),
				255>;

	// Defines multiplication patterns where the multiplication is happening on each			// Defines multiplication patterns where the multiplication is happening on each
	// Index'ed 8bit of a 32bit scalar value.			// Index'ed 8bit of a 32bit scalar value.
	foreach Index = [1, 2, 3, 4] in {
	def MulU_Elt#Index : PatFrag<			def MulU_Elt#Index : PatFrag<
	(ops node:$src0, node:$src1),			(ops node:$src0, node:$src1),
	(AMDGPUmul_u24_oneuse (!cast<Extract_U8>("UElt"#Index) node:$src0),			(AMDGPUmul_u24_oneuse (!cast<Extract_U>("U"#Index#"_8bit") node:$src0),
	(!cast<Extract_U8>("UElt"#Index) node:$src1))>;			(!cast<Extract_U>("U"#Index#"_8bit") node:$src1))>;
				}

				// Different variants of dot8 patterns cause a huge increase in the compile time.
				// Define non-associative/commutative add/mul to prevent permutation in the dot8
				// pattern.
				def NonACAdd : SDNode<"ISD::ADD" , SDTIntBinOp>;
				def NonACAdd_oneuse : HasOneUseBinOp<NonACAdd>;

				def NonACAMDGPUmul_u24 : SDNode<"AMDGPUISD::MUL_U24" , SDTIntBinOp>;
				def NonACAMDGPUmul_u24_oneuse : HasOneUseBinOp<NonACAMDGPUmul_u24>;

				foreach Index = 0-7 in {
				// Defines patterns that extract each Index'ed 4bit from an unsigned
				// 32bit scalar value;
				def U#Index#"_4bit" : Extract_U<!shl(Index, 2),
				15>;
				nhaehnleUnsubmitted Not Done Reply Inline Actions I realize this was already done this way for the 8bit case, but it would be cleaner to use Index 0-7 instead 1-8. 0-based indexing is more natural here, and would avoid the `!add(.., -1)`. nhaehnle: I realize this was already done this way for the 8bit case, but it would be cleaner to use…

				// Defines multiplication patterns where the multiplication is happening on each
				// Index'ed 8bit of a 32bit scalar value.
				def MulU#Index#"_4bit" : PatFrag<
				(ops node:$src0, node:$src1),
				(NonACAMDGPUmul_u24_oneuse (!cast<Extract_U>("U"#Index#"_4bit") node:$src0),
				(!cast<Extract_U>("U"#Index#"_4bit") node:$src1))>;
	}			}

	class UDot2Pat<Instruction Inst> : GCNPat <			class UDot2Pat<Instruction Inst> : GCNPat <
	(add (add_oneuse (AMDGPUmul_u24_oneuse (srl i32:$src0, (i32 16)),			(add (add_oneuse (AMDGPUmul_u24_oneuse (srl i32:$src0, (i32 16)),
	(srl i32:$src1, (i32 16))), i32:$src2),			(srl i32:$src1, (i32 16))), i32:$src2),
	(AMDGPUmul_u24_oneuse (and i32:$src0, (i32 65535)),			(AMDGPUmul_u24_oneuse (and i32:$src0, (i32 65535)),
	(and i32:$src1, (i32 65535)))			(and i32:$src1, (i32 65535)))
	),			),
	Show All 34 Lines
	defm : DotPats<int_amdgcn_udot4, V_DOT4_U32_U8>;			defm : DotPats<int_amdgcn_udot4, V_DOT4_U32_U8>;
	defm : DotPats<int_amdgcn_sdot8, V_DOT8_I32_I4>;			defm : DotPats<int_amdgcn_sdot8, V_DOT8_I32_I4>;
	defm : DotPats<int_amdgcn_udot8, V_DOT8_U32_U4>;			defm : DotPats<int_amdgcn_udot8, V_DOT8_U32_U4>;

	def : UDot2Pat<V_DOT2_U32_U16>;			def : UDot2Pat<V_DOT2_U32_U16>;
	def : SDot2Pat<V_DOT2_I32_I16>;			def : SDot2Pat<V_DOT2_I32_I16>;

	def : GCNPat <			def : GCNPat <
	!cast<dag>(!foldl((i32 i32:$src2), [1, 2, 3, 4], lhs, y,			!cast<dag>(!foldl((i32 i32:$src2), [0, 1, 2, 3], lhs, y,
	(add_oneuse lhs, (!cast<PatFrag>("MulU_Elt"#y) i32:$src0, i32:$src1)))),			(add_oneuse lhs, (!cast<PatFrag>("MulU_Elt"#y) i32:$src0, i32:$src1)))),
	(V_DOT4_U32_U8 (i32 8), $src0, (i32 8), $src1, (i32 8), $src2, (i1 0))			(V_DOT4_U32_U8 (i32 8), $src0, (i32 8), $src1, (i32 8), $src2, (i1 0))
	>;			>;

				def : GCNPat <
				!cast<dag>(!foldl((add_oneuse i32:$src2, (MulU0_4bit i32:$src0, i32:$src1)), [1, 2, 3, 4, 5, 6, 7], lhs, y,
				(NonACAdd_oneuse lhs, (!cast<PatFrag>("MulU"#y#"_4bit") i32:$src0, i32:$src1)))),
				(V_DOT8_U32_U4 (i32 8), $src0, (i32 8), $src1, (i32 8), $src2, (i1 0))
				>;

	} // End SubtargetPredicate = HasDLInsts			} // End SubtargetPredicate = HasDLInsts

	multiclass VOP3P_Real_vi<bits<10> op> {			multiclass VOP3P_Real_vi<bits<10> op> {
	def _vi : VOP3P_Real<!cast<VOP3_Pseudo>(NAME), SIEncodingFamily.VI>,			def _vi : VOP3P_Real<!cast<VOP3_Pseudo>(NAME), SIEncodingFamily.VI>,
	VOP3Pe <op, !cast<VOP3_Pseudo>(NAME).Pfl> {			VOP3Pe <op, !cast<VOP3_Pseudo>(NAME).Pfl> {
	let AssemblerPredicates = [HasVOP3PInsts];			let AssemblerPredicates = [HasVOP3PInsts];
	let DecoderNamespace = "VI";			let DecoderNamespace = "VI";
	}			}
	▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/idot8.ll

This file was added.

				; RUN: llc -mtriple=amdgcn -mcpu=gfx700 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX789,GFX7,GFX78 %s
				; RUN: llc -mtriple=amdgcn -mcpu=gfx803 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX789,GFX8,GFX78 %s
				; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GFX789,GFX9 %s
				; RUN: llc -mtriple=amdgcn -mcpu=gfx906 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GCN,GCN-DL,GFX9 %s

				define amdgpu_kernel void @udot8_acc32(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_acc32:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GCN-NEXT: s_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}

				; GFX789: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789-NEXT: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789-NEXT: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789-NEXT: s_lshr_b32 s{{[0-9]+}}, s{{[0-9]+}}, 28
				; GFX789-NEXT: s_lshr_b32 s{{[0-9]+}}, s{{[0-9]+}}, 28
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40018
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40008
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40004
				; GFX789-NEXT: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 15
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40018
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40008
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40004
				; GFX789-NEXT: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 15
				; GFX789-NEXT: v_mov_b32_e32 v{{[0-9]+}}
				; GFX789-NEXT: v_mov_b32_e32 [[SRC2:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD1:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[SRC2]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E2:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD2:v[0-9]+]], s{{[0-9]+}}, [[V2E2]], [[MAD1]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E3:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD3:v[0-9]+]], s{{[0-9]+}}, [[V2E3]], [[MAD2]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E4:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD4:v[0-9]+]], s{{[0-9]+}}, [[V2E4]], [[MAD3]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E5:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD5:v[0-9]+]], s{{[0-9]+}}, [[V2E5]], [[MAD4]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E6:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD6:v[0-9]+]], s{{[0-9]+}}, [[V2E6]], [[MAD5]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E7:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD7:v[0-9]+]], s{{[0-9]+}}, [[V2E7]], [[MAD6]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E8:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD8:v[0-9]+]], s{{[0-9]+}}, [[V2E8]], [[MAD7]]
				; GFX789-NEXT: {{buffer\|flat\|global}}_store_dword
				; GFX789-NEXT: s_endpgm

				; GCN-DL: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_load_dword s2, s[4:5], 0x0
				; GCN-DL-NEXT: s_load_dword s4, s[6:7], 0x0
				; GCN-DL-NEXT: s_load_dword s5, s[0:1], 0x0
				; GCN-DL-NEXT: v_mov_b32_e32 v0, s0
				; GCN-DL-NEXT: v_mov_b32_e32 v1, s1
				; GCN-DL-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: v_mov_b32_e32 v2, s4
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s5
				; GCN-DL-NEXT: v_dot8_u32_u4 v2, s2, v2, v3
				; GCN-DL-NEXT: global_store_dword v[0:1], v2, off
				; GCN-DL-NEXT: s_endpgm

				<8 x i4> addrspace(1)* %src2,
				i32 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%v1e0 = extractelement <8 x i4> %vec1, i64 0
				%cv1e0 = zext i4 %v1e0 to i32
				%v2e0 = extractelement <8 x i4> %vec2, i64 0
				%cv2e0 = zext i4 %v2e0 to i32
				%mul0 = mul nuw nsw i32 %cv1e0, %cv2e0

				%v1e1 = extractelement <8 x i4> %vec1, i64 1
				%cv1e1 = zext i4 %v1e1 to i32
				%v2e1 = extractelement <8 x i4> %vec2, i64 1
				%cv2e1 = zext i4 %v2e1 to i32
				%mul1 = mul nuw nsw i32 %cv1e1, %cv2e1

				%v1e2 = extractelement <8 x i4> %vec1, i64 2
				%cv1e2 = zext i4 %v1e2 to i32
				%v2e2 = extractelement <8 x i4> %vec2, i64 2
				%cv2e2 = zext i4 %v2e2 to i32
				%mul2 = mul nuw nsw i32 %cv1e2, %cv2e2

				%v1e3 = extractelement <8 x i4> %vec1, i64 3
				%cv1e3 = zext i4 %v1e3 to i32
				%v2e3 = extractelement <8 x i4> %vec2, i64 3
				%cv2e3 = zext i4 %v2e3 to i32
				%mul3 = mul nuw nsw i32 %cv1e3, %cv2e3

				%v1e4 = extractelement <8 x i4> %vec1, i64 4
				%cv1e4 = zext i4 %v1e4 to i32
				%v2e4 = extractelement <8 x i4> %vec2, i64 4
				%cv2e4 = zext i4 %v2e4 to i32
				%mul4 = mul nuw nsw i32 %cv1e4, %cv2e4

				%v1e5 = extractelement <8 x i4> %vec1, i64 5
				%cv1e5 = zext i4 %v1e5 to i32
				%v2e5 = extractelement <8 x i4> %vec2, i64 5
				%cv2e5 = zext i4 %v2e5 to i32
				%mul5 = mul nuw nsw i32 %cv1e5, %cv2e5

				%v1e6 = extractelement <8 x i4> %vec1, i64 6
				%cv1e6 = zext i4 %v1e6 to i32
				%v2e6 = extractelement <8 x i4> %vec2, i64 6
				%cv2e6 = zext i4 %v2e6 to i32
				%mul6 = mul nuw nsw i32 %cv1e6, %cv2e6

				%v1e7 = extractelement <8 x i4> %vec1, i64 7
				%cv1e7 = zext i4 %v1e7 to i32
				%v2e7 = extractelement <8 x i4> %vec2, i64 7
				%cv2e7 = zext i4 %v2e7 to i32
				%mul7 = mul nuw nsw i32 %cv1e7, %cv2e7

				%acc = load i32, i32 addrspace(1)* %dst, align 4
				%add1 = add i32 %mul0, %acc
				%add2 = add i32 %add1, %mul1
				%add3 = add i32 %add2, %mul2
				%add4 = add i32 %add3, %mul3
				%add5 = add i32 %add4, %mul4
				%add6 = add i32 %add5, %mul5
				%add7 = add i32 %add6, %mul6
				%add8 = add i32 %add7, %mul7

				store i32 %add8, i32 addrspace(1)* %dst, align 4
				ret void
				}

				; TODO: Remove the unnecessary instruction(that is zero-extending the
				; 2nd MAD) to have the pattern-recognizer to kick in.
				define amdgpu_kernel void @udot8_acc16(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_acc16:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GCN-NEXT: s_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GFX789: s_load_dword
				; GFX789: {{buffer\|flat\|global}}_load_ushort
				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40004
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789: s_waitcnt vmcnt(0)
				; GFX789-NEXT: v_mad_u32_u24 [[MAD1:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GFX789-NEXT: v_mad_u32_u24 [[MAD2:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD1]]
				; GFX789: v_mad_u32_u24 [[MAD3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD2]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD4:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD3]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD5:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD4]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD6:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD5]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD7:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD6]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E8:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD8:v[0-9]+]], s{{[0-9]+}}, [[V2E8]], [[MAD7]]
				; GFX789-NEXT: {{buffer\|flat\|global}}_store_short
				; GFX789-NEXT: s_endpgm

				; GCN-DL: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_load_dword s2, s[4:5], 0x0
				; GCN-DL-NEXT: s_load_dword s4, s[6:7], 0x0
				; GCN-DL-NEXT: v_mov_b32_e32 v0, s0
				; GCN-DL-NEXT: v_mov_b32_e32 v1, s1
				; GCN-DL-NEXT: global_load_ushort v2, v[0:1], off
				; GCN-DL-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_and_b32 s0, s2, 15
				; GCN-DL-NEXT: s_and_b32 s1, s4, 15
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s1
				; GCN-DL-NEXT: s_bfe_u32 s5, s4, 0x40004
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s5
				; GCN-DL-NEXT: s_bfe_u32 s1, s2, 0x40004
				; GCN-DL-NEXT: s_bfe_u32 s5, s4, 0x40008
				; GCN-DL-NEXT: s_bfe_u32 s8, s4, 0x40010
				; GCN-DL-NEXT: s_bfe_u32 s10, s4, 0x40014
				; GCN-DL-NEXT: s_bfe_u32 s12, s4, 0x40018
				; GCN-DL-NEXT: s_lshr_b32 s14, s4, 28
				; GCN-DL-NEXT: s_bfe_u32 s4, s4, 0x4000c
				; GCN-DL-NEXT: s_bfe_u32 s6, s2, 0x40008
				; GCN-DL-NEXT: v_mov_b32_e32 v5, s5
				; GCN-DL-NEXT: s_bfe_u32 s7, s2, 0x4000c
				; GCN-DL-NEXT: v_mov_b32_e32 v6, s4
				; GCN-DL-NEXT: s_bfe_u32 s9, s2, 0x40010
				; GCN-DL-NEXT: v_mov_b32_e32 v7, s8
				; GCN-DL-NEXT: s_bfe_u32 s11, s2, 0x40014
				; GCN-DL-NEXT: v_mov_b32_e32 v8, s10
				; GCN-DL-NEXT: s_bfe_u32 s13, s2, 0x40018
				; GCN-DL-NEXT: v_mov_b32_e32 v9, s12
				; GCN-DL-NEXT: s_lshr_b32 s2, s2, 28
				; GCN-DL-NEXT: s_waitcnt vmcnt(0)
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s0, v3, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s1, v4, v2
				; GCN-DL-NEXT: v_and_b32_e32 v2, 0xffff, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s6, v5, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s7, v6, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s9, v7, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s11, v8, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s13, v9, v2
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s14
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s2, v3, v2
				; GCN-DL-NEXT: global_store_short v[0:1], v2, off
				; GCN-DL-NEXT: s_endpgm
				<8 x i4> addrspace(1)* %src2,
				i16 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%v1e0 = extractelement <8 x i4> %vec1, i64 0
				%cv1e0 = zext i4 %v1e0 to i16
				%v2e0 = extractelement <8 x i4> %vec2, i64 0
				%cv2e0 = zext i4 %v2e0 to i16
				%mul0 = mul nuw nsw i16 %cv1e0, %cv2e0

				%v1e1 = extractelement <8 x i4> %vec1, i64 1
				%cv1e1 = zext i4 %v1e1 to i16
				%v2e1 = extractelement <8 x i4> %vec2, i64 1
				%cv2e1 = zext i4 %v2e1 to i16
				%mul1 = mul nuw nsw i16 %cv1e1, %cv2e1

				%v1e2 = extractelement <8 x i4> %vec1, i64 2
				%cv1e2 = zext i4 %v1e2 to i16
				%v2e2 = extractelement <8 x i4> %vec2, i64 2
				%cv2e2 = zext i4 %v2e2 to i16
				%mul2 = mul nuw nsw i16 %cv1e2, %cv2e2

				%v1e3 = extractelement <8 x i4> %vec1, i64 3
				%cv1e3 = zext i4 %v1e3 to i16
				%v2e3 = extractelement <8 x i4> %vec2, i64 3
				%cv2e3 = zext i4 %v2e3 to i16
				%mul3 = mul nuw nsw i16 %cv1e3, %cv2e3

				%v1e4 = extractelement <8 x i4> %vec1, i64 4
				%cv1e4 = zext i4 %v1e4 to i16
				%v2e4 = extractelement <8 x i4> %vec2, i64 4
				%cv2e4 = zext i4 %v2e4 to i16
				%mul4 = mul nuw nsw i16 %cv1e4, %cv2e4

				%v1e5 = extractelement <8 x i4> %vec1, i64 5
				%cv1e5 = zext i4 %v1e5 to i16
				%v2e5 = extractelement <8 x i4> %vec2, i64 5
				%cv2e5 = zext i4 %v2e5 to i16
				%mul5 = mul nuw nsw i16 %cv1e5, %cv2e5

				%v1e6 = extractelement <8 x i4> %vec1, i64 6
				%cv1e6 = zext i4 %v1e6 to i16
				%v2e6 = extractelement <8 x i4> %vec2, i64 6
				%cv2e6 = zext i4 %v2e6 to i16
				%mul6 = mul nuw nsw i16 %cv1e6, %cv2e6

				%v1e7 = extractelement <8 x i4> %vec1, i64 7
				%cv1e7 = zext i4 %v1e7 to i16
				%v2e7 = extractelement <8 x i4> %vec2, i64 7
				%cv2e7 = zext i4 %v2e7 to i16
				%mul7 = mul nuw nsw i16 %cv1e7, %cv2e7

				%acc = load i16, i16 addrspace(1)* %dst, align 4
				%add1 = add i16 %mul0, %acc
				%add2 = add i16 %add1, %mul1
				%add3 = add i16 %add2, %mul2
				%add4 = add i16 %add3, %mul3
				%add5 = add i16 %add4, %mul4
				%add6 = add i16 %add5, %mul5
				%add7 = add i16 %add6, %mul6
				%add8 = add i16 %add7, %mul7

				store i16 %add8, i16 addrspace(1)* %dst, align 4
				ret void
				}

				; TODO: Remove the unnecessary instruction(that is zero-extending the
				; 2nd MAD) to have the pattern-recognizer to kick in.
				define amdgpu_kernel void @udot8_acc8(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_acc8:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GCN-NEXT: s_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789: s_load_dword
				; GFX789: s_load_dword
				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789: s_lshr_b32 s{{[0-9]+}}, s{{[0-9]+}}, 28
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40008
				; GFX789: s_waitcnt vmcnt(0)
				; GFX789-NEXT: v_mad_u32_u24 [[MAD1:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GFX789-NEXT: v_mad_u32_u24 [[MAD2:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD1]]
				; GFX789: v_mad_u32_u24 [[MAD3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD2]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD4:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD3]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD5:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD4]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD6:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD5]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD7:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD6]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E8:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD8:v[0-9]+]], s{{[0-9]+}}, [[V2E8]], [[MAD7]]
				; GFX789-NEXT: {{buffer\|flat\|global}}_store_byte
				; GFX789-NEXT: s_endpgm

				; GCN-DL: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_load_dword s2, s[4:5], 0x0
				; GCN-DL-NEXT: s_load_dword s4, s[6:7], 0x0
				; GCN-DL-NEXT: v_mov_b32_e32 v0, s0
				; GCN-DL-NEXT: v_mov_b32_e32 v1, s1
				; GCN-DL-NEXT: global_load_ubyte v2, v[0:1], off
				; GCN-DL-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_and_b32 s0, s2, 15
				; GCN-DL-NEXT: s_and_b32 s1, s4, 15
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s1
				arsenmUnsubmitted Not Done Reply Inline Actions This is a lot of hardcoded registers for a non-generated test. Either add the checks or use update_llc_test_checks? arsenm: This is a lot of hardcoded registers for a non-generated test. Either add the checks or use…
				FarhanaAleenAuthorUnsubmitted Not Done Reply Inline Actions Hi Matt, This is generated by update_llc_test_checks with a little modification. I just used common label for some of the common instructions and removed some of the instructions that are unrelated with this testing in order to have a concise form. FarhanaAleen: Hi Matt, This is generated by update_llc_test_checks with a little modification. I just used…
				nhaehnleUnsubmitted Not Done Reply Inline Actions But this means somebody will have to manually repeat that work when some arbitrary register allocation changes happen. Better to just take the raw output from update_llc_test_checks, since the common label/instructions don't actually seem to save that many lines. nhaehnle: But this means somebody will have to manually repeat that work when some arbitrary register…
				; GCN-DL-NEXT: s_bfe_u32 s5, s4, 0x40004
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s5
				; GCN-DL-NEXT: s_bfe_u32 s1, s2, 0x40004
				; GCN-DL-NEXT: s_bfe_u32 s5, s4, 0x40008
				; GCN-DL-NEXT: s_bfe_u32 s8, s4, 0x40010
				; GCN-DL-NEXT: s_bfe_u32 s10, s4, 0x40014
				; GCN-DL-NEXT: s_bfe_u32 s12, s4, 0x40018
				; GCN-DL-NEXT: s_lshr_b32 s14, s4, 28
				; GCN-DL-NEXT: s_bfe_u32 s4, s4, 0x4000c
				; GCN-DL-NEXT: s_bfe_u32 s6, s2, 0x40008
				; GCN-DL-NEXT: v_mov_b32_e32 v5, s5
				; GCN-DL-NEXT: s_bfe_u32 s7, s2, 0x4000c
				; GCN-DL-NEXT: v_mov_b32_e32 v6, s4
				; GCN-DL-NEXT: s_bfe_u32 s9, s2, 0x40010
				; GCN-DL-NEXT: v_mov_b32_e32 v7, s8
				; GCN-DL-NEXT: s_bfe_u32 s11, s2, 0x40014
				; GCN-DL-NEXT: v_mov_b32_e32 v8, s10
				; GCN-DL-NEXT: s_bfe_u32 s13, s2, 0x40018
				; GCN-DL-NEXT: v_mov_b32_e32 v9, s12
				; GCN-DL-NEXT: s_lshr_b32 s2, s2, 28
				; GCN-DL-NEXT: s_waitcnt vmcnt(0)
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s0, v3, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s1, v4, v2
				; GCN-DL-NEXT: v_and_b32_e32 v2, 0xff, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s6, v5, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s7, v6, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s9, v7, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s11, v8, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s13, v9, v2
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s14
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s2, v3, v2
				; GCN-DL-NEXT: global_store_byte v[0:1], v2, off
				; GCN-DL-NEXT: s_endpgm

				<8 x i4> addrspace(1)* %src2,
				i8 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%v1e0 = extractelement <8 x i4> %vec1, i64 0
				%cv1e0 = zext i4 %v1e0 to i8
				%v2e0 = extractelement <8 x i4> %vec2, i64 0
				%cv2e0 = zext i4 %v2e0 to i8
				%mul0 = mul nuw nsw i8 %cv1e0, %cv2e0

				%v1e1 = extractelement <8 x i4> %vec1, i64 1
				%cv1e1 = zext i4 %v1e1 to i8
				%v2e1 = extractelement <8 x i4> %vec2, i64 1
				%cv2e1 = zext i4 %v2e1 to i8
				%mul1 = mul nuw nsw i8 %cv1e1, %cv2e1

				%v1e2 = extractelement <8 x i4> %vec1, i64 2
				%cv1e2 = zext i4 %v1e2 to i8
				%v2e2 = extractelement <8 x i4> %vec2, i64 2
				%cv2e2 = zext i4 %v2e2 to i8
				%mul2 = mul nuw nsw i8 %cv1e2, %cv2e2

				%v1e3 = extractelement <8 x i4> %vec1, i64 3
				%cv1e3 = zext i4 %v1e3 to i8
				%v2e3 = extractelement <8 x i4> %vec2, i64 3
				%cv2e3 = zext i4 %v2e3 to i8
				%mul3 = mul nuw nsw i8 %cv1e3, %cv2e3

				%v1e4 = extractelement <8 x i4> %vec1, i64 4
				%cv1e4 = zext i4 %v1e4 to i8
				%v2e4 = extractelement <8 x i4> %vec2, i64 4
				%cv2e4 = zext i4 %v2e4 to i8
				%mul4 = mul nuw nsw i8 %cv1e4, %cv2e4

				%v1e5 = extractelement <8 x i4> %vec1, i64 5
				%cv1e5 = zext i4 %v1e5 to i8
				%v2e5 = extractelement <8 x i4> %vec2, i64 5
				%cv2e5 = zext i4 %v2e5 to i8
				%mul5 = mul nuw nsw i8 %cv1e5, %cv2e5

				%v1e6 = extractelement <8 x i4> %vec1, i64 6
				%cv1e6 = zext i4 %v1e6 to i8
				%v2e6 = extractelement <8 x i4> %vec2, i64 6
				%cv2e6 = zext i4 %v2e6 to i8
				%mul6 = mul nuw nsw i8 %cv1e6, %cv2e6

				%v1e7 = extractelement <8 x i4> %vec1, i64 7
				%cv1e7 = zext i4 %v1e7 to i8
				%v2e7 = extractelement <8 x i4> %vec2, i64 7
				%cv2e7 = zext i4 %v2e7 to i8
				%mul7 = mul nuw nsw i8 %cv1e7, %cv2e7

				%acc = load i8, i8 addrspace(1)* %dst, align 4
				%add1 = add i8 %mul0, %acc
				%add2 = add i8 %add1, %mul1
				%add3 = add i8 %add2, %mul2
				%add4 = add i8 %add3, %mul3
				%add5 = add i8 %add4, %mul4
				%add6 = add i8 %add5, %mul5
				%add7 = add i8 %add6, %mul6
				%add8 = add i8 %add7, %mul7

				store i8 %add8, i8 addrspace(1)* %dst, align 4
				ret void
				}

				; TODO: Remove the two unnecessary instructions(and+add after 2nd MAD)
				; to have the pattern-recognizer to kick in.
				define amdgpu_kernel void @udot8_acc4(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_acc4:
				; GCN: ; %bb.0: ; %entry
				; GCN: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GCN-NEXT: s_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}

				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789-NEXT: s_load_dword
				; GFX789: s_load_dword
				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				nhaehnleUnsubmitted Not Done Reply Inline Actions Why isn't this testcase using the v_dot instruction? Could this be fixed relatively easily by extending the `MulU#Index#"_4bit"` PatFrag with a second pattern? nhaehnle: Why isn't this testcase using the v_dot instruction? Could this be fixed relatively easily by…
				FarhanaAleenAuthorUnsubmitted Not Done Reply Inline Actions Yes, it can be fixed by writing a second pattern but that would be a workaround. The issue here is the additional 'and' instruction which should be removed(it's not generated in the earlier generation such as GFX7). Otherwise it will keep causing problems somewhere else and we might end up keep adding patterns or additional unnecessary/temporary code to work around it all those places. I will look in to it as a separate action. FarhanaAleen: Yes, it can be fixed by writing a second pattern but that would be a workaround. The issue here…
				; GFX789: s_waitcnt vmcnt(0)
				; GFX789-NEXT: v_mad_u32_u24 [[MAD1:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GFX789-NEXT: v_mad_u32_u24 [[MAD2:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD1]]
				; GFX789: v_mad_u32_u24 [[MAD3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD2]]
				; GFX789: v_mad_u32_u24 [[MAD4:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD3]]
				; GFX789: v_mad_u32_u24 [[MAD5:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD4]]
				; GFX789: v_mad_u32_u24 [[MAD6:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD5]]
				; GFX789: v_mad_u32_u24 [[MAD7:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD6]]
				; GFX789: {{buffer\|flat\|global}}_store_byte
				; GFX789-NEXT: s_endpgm

				; GCN-DL: v_mov_b32_e32 v0, s0
				; GCN-DL-NEXT: v_mov_b32_e32 v1, s1
				; GCN-DL-NEXT: global_load_ubyte v2, v[0:1], off
				; GCN-DL-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_and_b32 s0, s2, 15
				; GCN-DL-NEXT: s_and_b32 s1, s4, 15
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s1
				; GCN-DL-NEXT: s_bfe_u32 s5, s4, 0x40004
				; GCN-DL-NEXT: s_bfe_u32 s6, s4, 0x40008
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s6
				; GCN-DL-NEXT: s_bfe_u32 s7, s2, 0x40008
				; GCN-DL-NEXT: v_mov_b32_e32 v5, s5
				; GCN-DL-NEXT: s_bfe_u32 s1, s2, 0x40004
				; GCN-DL-NEXT: v_mul_u32_u24_e32 v4, s7, v4
				; GCN-DL-NEXT: s_bfe_u32 s5, s4, 0x4000c
				; GCN-DL-NEXT: v_and_b32_e32 v4, 15, v4
				; GCN-DL-NEXT: s_bfe_u32 s7, s4, 0x40010
				; GCN-DL-NEXT: v_mov_b32_e32 v6, s5
				; GCN-DL-NEXT: s_bfe_u32 s6, s2, 0x4000c
				; GCN-DL-NEXT: s_bfe_u32 s8, s4, 0x40014
				; GCN-DL-NEXT: v_mov_b32_e32 v7, s7
				; GCN-DL-NEXT: s_bfe_u32 s5, s2, 0x40010
				; GCN-DL-NEXT: s_bfe_u32 s9, s4, 0x40018
				; GCN-DL-NEXT: v_mov_b32_e32 v8, s8
				; GCN-DL-NEXT: s_bfe_u32 s7, s2, 0x40014
				; GCN-DL-NEXT: s_bfe_u32 s8, s2, 0x40018
				; GCN-DL-NEXT: s_lshr_b32 s4, s4, 28
				; GCN-DL-NEXT: v_mov_b32_e32 v9, s9
				; GCN-DL-NEXT: s_lshr_b32 s2, s2, 28
				; GCN-DL-NEXT: s_waitcnt vmcnt(0)
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s0, v3, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s1, v5, v2
				; GCN-DL-NEXT: v_and_b32_e32 v2, 15, v2
				; GCN-DL-NEXT: v_add_u32_e32 v2, v2, v4
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s6, v6, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s5, v7, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s7, v8, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s8, v9, v2
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s4
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s2, v3, v2
				; GCN-DL-NEXT: v_and_b32_e32 v2, 15, v2
				; GCN-DL-NEXT: global_store_byte v[0:1], v2, off
				; GCN-DL-NEXT: s_endpgm
				<8 x i4> addrspace(1)* %src2,
				i4 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%v1e0 = extractelement <8 x i4> %vec1, i64 0
				%v2e0 = extractelement <8 x i4> %vec2, i64 0
				%mul0 = mul nuw nsw i4 %v1e0, %v2e0

				%v1e1 = extractelement <8 x i4> %vec1, i64 1
				%v2e1 = extractelement <8 x i4> %vec2, i64 1
				%mul1 = mul nuw nsw i4 %v1e1, %v2e1

				%v1e2 = extractelement <8 x i4> %vec1, i64 2
				%v2e2 = extractelement <8 x i4> %vec2, i64 2
				%mul2 = mul nuw nsw i4 %v1e2, %v2e2

				%v1e3 = extractelement <8 x i4> %vec1, i64 3
				%v2e3 = extractelement <8 x i4> %vec2, i64 3
				%mul3 = mul nuw nsw i4 %v1e3, %v2e3

				%v1e4 = extractelement <8 x i4> %vec1, i64 4
				%v2e4 = extractelement <8 x i4> %vec2, i64 4
				%mul4 = mul nuw nsw i4 %v1e4, %v2e4

				%v1e5 = extractelement <8 x i4> %vec1, i64 5
				%v2e5 = extractelement <8 x i4> %vec2, i64 5
				%mul5 = mul nuw nsw i4 %v1e5, %v2e5

				%v1e6 = extractelement <8 x i4> %vec1, i64 6
				%v2e6 = extractelement <8 x i4> %vec2, i64 6
				%mul6 = mul nuw nsw i4 %v1e6, %v2e6

				%v1e7 = extractelement <8 x i4> %vec1, i64 7
				%v2e7 = extractelement <8 x i4> %vec2, i64 7
				%mul7 = mul nuw nsw i4 %v1e7, %v2e7

				%acc = load i4, i4 addrspace(1)* %dst, align 4
				%add1 = add i4 %mul0, %acc
				%add2 = add i4 %add1, %mul1
				%add3 = add i4 %add2, %mul2
				%add4 = add i4 %add3, %mul3
				%add5 = add i4 %add4, %mul4
				%add6 = add i4 %add5, %mul5
				%add7 = add i4 %add6, %mul6
				%add8 = add i4 %add7, %mul7

				store i4 %add8, i4 addrspace(1)* %dst, align 4
				ret void
				}

				; TODO: Currently, permutation of udot8 is turned off due to a huge increase
				; in the compile time.
				define amdgpu_kernel void @udot8_CommutationInsideMAD(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_CommutationInsideMAD:
				; GCN: ; %bb.0: ; %entry
				; GCN: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GCN-NEXT: s_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}

				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789-NEXT: s_load_dword
				; GFX789: s_load_dword
				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789: s_waitcnt vmcnt(0)
				; GFX789-NEXT: v_mad_u32_u24 [[MAD1:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GFX789-NEXT: v_mad_u32_u24 [[MAD2:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD1]]
				; GFX789: v_mad_u32_u24 [[MAD3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD2]]
				; GFX789: v_mad_u32_u24 [[MAD4:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD3]]
				; GFX789: v_mad_u32_u24 [[MAD5:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD4]]
				; GFX789: v_mad_u32_u24 [[MAD6:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD5]]
				; GFX789: v_mad_u32_u24 [[MAD7:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[MAD6]]
				; GFX789: {{buffer\|flat\|global}}_store_byte
				; GFX789-NEXT: s_endpgm

				; GCN-DL: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_load_dword s2, s[4:5], 0x0
				; GCN-DL-NEXT: s_load_dword s4, s[6:7], 0x0
				; GCN-DL-NEXT: v_mov_b32_e32 v0, s0
				; GCN-DL-NEXT: v_mov_b32_e32 v1, s1
				; GCN-DL-NEXT: global_load_ubyte v2, v[0:1], off
				; GCN-DL-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_and_b32 s0, s2, 15
				; GCN-DL-NEXT: s_and_b32 s1, s4, 15
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s1
				; GCN-DL-NEXT: s_bfe_u32 s5, s4, 0x40004
				; GCN-DL-NEXT: s_bfe_u32 s6, s4, 0x40008
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s5
				; GCN-DL-NEXT: s_bfe_u32 s1, s2, 0x40004
				; GCN-DL-NEXT: s_bfe_u32 s7, s4, 0x4000c
				; GCN-DL-NEXT: v_mov_b32_e32 v5, s6
				; GCN-DL-NEXT: s_bfe_u32 s5, s2, 0x40008
				; GCN-DL-NEXT: s_bfe_u32 s8, s4, 0x40010
				; GCN-DL-NEXT: v_mov_b32_e32 v6, s7
				; GCN-DL-NEXT: s_bfe_u32 s6, s2, 0x4000c
				; GCN-DL-NEXT: s_bfe_u32 s9, s4, 0x40014
				; GCN-DL-NEXT: v_mov_b32_e32 v7, s8
				; GCN-DL-NEXT: s_bfe_u32 s7, s2, 0x40010
				; GCN-DL-NEXT: s_bfe_u32 s10, s4, 0x40018
				; GCN-DL-NEXT: v_mov_b32_e32 v8, s9
				; GCN-DL-NEXT: s_bfe_u32 s8, s2, 0x40014
				; GCN-DL-NEXT: s_bfe_u32 s9, s2, 0x40018
				; GCN-DL-NEXT: s_lshr_b32 s4, s4, 28
				; GCN-DL-NEXT: v_mov_b32_e32 v9, s10
				; GCN-DL-NEXT: s_lshr_b32 s2, s2, 28
				; GCN-DL-NEXT: s_waitcnt vmcnt(0)
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s0, v3, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s1, v4, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s5, v5, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s6, v6, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s7, v7, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s8, v8, v2
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s9, v9, v2
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s4
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s2, v3, v2
				; GCN-DL-NEXT: v_and_b32_e32 v2, 15, v2
				; GCN-DL-NEXT: global_store_byte v[0:1], v2, off
				; GCN-DL-NEXT: s_endpgm

				<8 x i4> addrspace(1)* %src2,
				i4 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%v1e0 = extractelement <8 x i4> %vec1, i64 0
				%v2e0 = extractelement <8 x i4> %vec2, i64 0
				%mul0 = mul nuw nsw i4 %v1e0, %v2e0

				%v1e1 = extractelement <8 x i4> %vec1, i64 1
				%v2e1 = extractelement <8 x i4> %vec2, i64 1
				%mul1 = mul nuw nsw i4 %v1e1, %v2e1

				%v1e2 = extractelement <8 x i4> %vec1, i64 2
				%v2e2 = extractelement <8 x i4> %vec2, i64 2
				%mul2 = mul nuw nsw i4 %v1e2, %v2e2

				%v1e3 = extractelement <8 x i4> %vec1, i64 3
				%v2e3 = extractelement <8 x i4> %vec2, i64 3
				%mul3 = mul nuw nsw i4 %v1e3, %v2e3

				%v1e4 = extractelement <8 x i4> %vec1, i64 4
				%v2e4 = extractelement <8 x i4> %vec2, i64 4
				%mul4 = mul nuw nsw i4 %v1e4, %v2e4

				%v1e5 = extractelement <8 x i4> %vec1, i64 5
				%v2e5 = extractelement <8 x i4> %vec2, i64 5
				%mul5 = mul nuw nsw i4 %v1e5, %v2e5

				%v1e6 = extractelement <8 x i4> %vec1, i64 6
				%v2e6 = extractelement <8 x i4> %vec2, i64 6
				%mul6 = mul nuw nsw i4 %v1e6, %v2e6

				%v1e7 = extractelement <8 x i4> %vec1, i64 7
				%v2e7 = extractelement <8 x i4> %vec2, i64 7
				%mul7 = mul nuw nsw i4 %v1e7, %v2e7

				%acc = load i4, i4 addrspace(1)* %dst, align 4
				%add1 = add i4 %mul0, %acc
				%add2 = add i4 %mul1, %add1
				%add3 = add i4 %mul2, %add2
				%add4 = add i4 %mul3, %add3
				%add5 = add i4 %mul4, %add4
				%add6 = add i4 %mul5, %add5
				%add7 = add i4 %mul6, %add6
				%add8 = add i4 %mul7, %add7

				store i4 %add8, i4 addrspace(1)* %dst, align 4
				ret void
				}

				define amdgpu_kernel void @udot8_multiuses_mul1(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_multiuses_mul1:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GCN-NEXT: s_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}

				; GFX789: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789-NEXT: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789-NEXT: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789-NEXT: s_lshr_b32 s{{[0-9]+}}, s{{[0-9]+}}, 28
				; GFX789: s_lshr_b32 s{{[0-9]+}}, s{{[0-9]+}}, 28
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40018
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40008
				; GFX789: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 15
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40018
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40008
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40004
				; GFX789-NEXT: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 15
				; GFX789-NEXT: v_mov_b32_e32 v{{[0-9]+}}
				; GFX789-NEXT: v_mov_b32_e32 [[SRC2:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD1:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[SRC2]]
				; GFX789: v_mov_b32_e32 [[V2E2:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24
				; GFX789: v_mov_b32_e32 [[V2E3:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}
				; GFX789-NEXT: v_mov_b32_e32 [[V2E4:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD4:v[0-9]+]], s{{[0-9]+}}, [[V2E4]], [[MAD3]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E5:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD5:v[0-9]+]], s{{[0-9]+}}, [[V2E5]], [[MAD4]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E6:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD6:v[0-9]+]], s{{[0-9]+}}, [[V2E6]], [[MAD5]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E7:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD7:v[0-9]+]], s{{[0-9]+}}, [[V2E7]], [[MAD6]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E8:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD8:v[0-9]+]]
				; GFX789: {{buffer\|flat\|global}}_store_dword
				; GFX789-NEXT: s_endpgm

				; GCN-DL: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_load_dword s2, s[4:5], 0x0
				; GCN-DL-NEXT: s_load_dword s4, s[6:7], 0x0
				; GCN-DL-NEXT: s_load_dword s5, s[0:1], 0x0
				; GCN-DL-NEXT: v_mov_b32_e32 v0, s0
				; GCN-DL-NEXT: v_mov_b32_e32 v1, s1
				; GCN-DL-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_lshr_b32 s0, s2, 28
				; GCN-DL-NEXT: s_bfe_u32 s17, s4, 0x40004
				; GCN-DL-NEXT: s_lshr_b32 s11, s4, 28
				; GCN-DL-NEXT: s_bfe_u32 s12, s4, 0x40018
				; GCN-DL-NEXT: s_bfe_u32 s13, s4, 0x40014
				; GCN-DL-NEXT: s_bfe_u32 s14, s4, 0x40010
				; GCN-DL-NEXT: s_bfe_u32 s15, s4, 0x4000c
				; GCN-DL-NEXT: s_bfe_u32 s16, s4, 0x40008
				; GCN-DL-NEXT: s_and_b32 s4, s4, 15
				; GCN-DL-NEXT: s_bfe_u32 s1, s2, 0x40018
				; GCN-DL-NEXT: s_bfe_u32 s6, s2, 0x40014
				; GCN-DL-NEXT: s_bfe_u32 s7, s2, 0x40010
				; GCN-DL-NEXT: s_bfe_u32 s8, s2, 0x4000c
				; GCN-DL-NEXT: s_bfe_u32 s9, s2, 0x40008
				; GCN-DL-NEXT: s_bfe_u32 s10, s2, 0x40004
				; GCN-DL-NEXT: s_and_b32 s2, s2, 15
				; GCN-DL-NEXT: v_mov_b32_e32 v2, s4
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s5
				; GCN-DL-NEXT: v_mad_u32_u24 v3, s2, v2, v3
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s17
				; GCN-DL-NEXT: v_mad_u32_u24 v2, s2, v2, v3
				; GCN-DL-NEXT: v_mad_u32_u24 v3, s10, v4, v3
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s16
				; GCN-DL-NEXT: v_mad_u32_u24 v3, s9, v4, v3
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s15
				; GCN-DL-NEXT: v_mad_u32_u24 v3, s8, v4, v3
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s14
				; GCN-DL-NEXT: v_mad_u32_u24 v3, s7, v4, v3
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s13
				; GCN-DL-NEXT: v_mad_u32_u24 v3, s6, v4, v3
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s12
				; GCN-DL-NEXT: v_mad_u32_u24 v3, s1, v4, v3
				; GCN-DL-NEXT: v_mov_b32_e32 v4, s11
				; GCN-DL-NEXT: v_mad_u32_u24 v3, s0, v4, v3
				; GCN-DL-NEXT: v_add_u32_e32 v2, v2, v3
				; GCN-DL-NEXT: global_store_dword v[0:1], v2, off
				; GCN-DL-NEXT: s_endpgm
				<8 x i4> addrspace(1)* %src2,
				i32 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%v1e0 = extractelement <8 x i4> %vec1, i64 0
				%cv1e0 = zext i4 %v1e0 to i32
				%v2e0 = extractelement <8 x i4> %vec2, i64 0
				%cv2e0 = zext i4 %v2e0 to i32
				%mul0 = mul nuw nsw i32 %cv1e0, %cv2e0

				%v1e1 = extractelement <8 x i4> %vec1, i64 1
				%cv1e1 = zext i4 %v1e1 to i32
				%v2e1 = extractelement <8 x i4> %vec2, i64 1
				%cv2e1 = zext i4 %v2e1 to i32
				%mul1 = mul nuw nsw i32 %cv1e1, %cv2e1

				%v1e2 = extractelement <8 x i4> %vec1, i64 2
				%cv1e2 = zext i4 %v1e2 to i32
				%v2e2 = extractelement <8 x i4> %vec2, i64 2
				%cv2e2 = zext i4 %v2e2 to i32
				%mul2 = mul nuw nsw i32 %cv1e2, %cv2e2

				%v1e3 = extractelement <8 x i4> %vec1, i64 3
				%cv1e3 = zext i4 %v1e3 to i32
				%v2e3 = extractelement <8 x i4> %vec2, i64 3
				%cv2e3 = zext i4 %v2e3 to i32
				%mul3 = mul nuw nsw i32 %cv1e3, %cv2e3

				%v1e4 = extractelement <8 x i4> %vec1, i64 4
				%cv1e4 = zext i4 %v1e4 to i32
				%v2e4 = extractelement <8 x i4> %vec2, i64 4
				%cv2e4 = zext i4 %v2e4 to i32
				%mul4 = mul nuw nsw i32 %cv1e4, %cv2e4

				%v1e5 = extractelement <8 x i4> %vec1, i64 5
				%cv1e5 = zext i4 %v1e5 to i32
				%v2e5 = extractelement <8 x i4> %vec2, i64 5
				%cv2e5 = zext i4 %v2e5 to i32
				%mul5 = mul nuw nsw i32 %cv1e5, %cv2e5

				%v1e6 = extractelement <8 x i4> %vec1, i64 6
				%cv1e6 = zext i4 %v1e6 to i32
				%v2e6 = extractelement <8 x i4> %vec2, i64 6
				%cv2e6 = zext i4 %v2e6 to i32
				%mul6 = mul nuw nsw i32 %cv1e6, %cv2e6

				%v1e7 = extractelement <8 x i4> %vec1, i64 7
				%cv1e7 = zext i4 %v1e7 to i32
				%v2e7 = extractelement <8 x i4> %vec2, i64 7
				%cv2e7 = zext i4 %v2e7 to i32
				%mul7 = mul nuw nsw i32 %cv1e7, %cv2e7

				%acc = load i32, i32 addrspace(1)* %dst, align 4
				%add1 = add i32 %mul0, %acc
				%add = add i32 %mul0, %add1
				%add2 = add i32 %add1, %mul1
				%add3 = add i32 %add2, %mul2
				%add4 = add i32 %add3, %mul3
				%add5 = add i32 %add4, %mul4
				%add6 = add i32 %add5, %mul5
				%add7 = add i32 %add6, %mul6
				%add8 = add i32 %add7, %mul7

				%res = add i32 %add, %add8
				store i32 %res, i32 addrspace(1)* %dst, align 4
				ret void
				}

				define amdgpu_kernel void @udot8_acc32_vecMul(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_acc32_vecMul:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx4 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}
				; GCN-NEXT: s_load_dwordx2 s{{\[[0-9]+:[0-9]+\]}}, s{{\[[0-9]+:[0-9]+\]}}

				; GFX789: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789-NEXT: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789-NEXT: s_load_dword s{{[0-9]+}}, s{{\[[0-9]+:[0-9]+\]}}, 0x0
				; GFX789: s_waitcnt lgkmcnt(0)
				; GFX789-NEXT: s_lshr_b32 s{{[0-9]+}}, s{{[0-9]+}}, 28
				; GFX789-NEXT: s_lshr_b32 s{{[0-9]+}}, s{{[0-9]+}}, 28
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40018
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40008
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40004
				; GFX789-NEXT: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 15
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40018
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40014
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40010
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x4000c
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40008
				; GFX789-NEXT: s_bfe_u32 s{{[0-9]+}}, s{{[0-9]+}}, 0x40004
				; GFX789-NEXT: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 15
				; GFX789-NEXT: v_mov_b32_e32 v{{[0-9]+}}
				; GFX789-NEXT: v_mov_b32_e32 [[SRC2:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD1:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, [[SRC2]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E2:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD2:v[0-9]+]], s{{[0-9]+}}, [[V2E2]], [[MAD1]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E3:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD3:v[0-9]+]], s{{[0-9]+}}, [[V2E3]], [[MAD2]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E4:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD4:v[0-9]+]], s{{[0-9]+}}, [[V2E4]], [[MAD3]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E5:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD5:v[0-9]+]], s{{[0-9]+}}, [[V2E5]], [[MAD4]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E6:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD6:v[0-9]+]], s{{[0-9]+}}, [[V2E6]], [[MAD5]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E7:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD7:v[0-9]+]], s{{[0-9]+}}, [[V2E7]], [[MAD6]]
				; GFX789-NEXT: v_mov_b32_e32 [[V2E8:v[0-9]+]]
				; GFX789-NEXT: v_mad_u32_u24 [[MAD8:v[0-9]+]], s{{[0-9]+}}, [[V2E8]], [[MAD7]]
				; GFX789-NEXT: {{buffer\|flat\|global}}_store_dword
				; GFX789-NEXT: s_endpgm

				; GCN-DL: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: s_load_dword s2, s[4:5], 0x0
				; GCN-DL-NEXT: s_load_dword s4, s[6:7], 0x0
				; GCN-DL-NEXT: s_load_dword s5, s[0:1], 0x0
				; GCN-DL-NEXT: v_mov_b32_e32 v0, s0
				; GCN-DL-NEXT: v_mov_b32_e32 v1, s1
				; GCN-DL-NEXT: s_waitcnt lgkmcnt(0)
				; GCN-DL-NEXT: v_mov_b32_e32 v2, s4
				; GCN-DL-NEXT: v_mov_b32_e32 v3, s5
				; GCN-DL-NEXT: v_dot8_u32_u4 v2, s2, v2, v3
				; GCN-DL-NEXT: global_store_dword v[0:1], v2, off
				; GCN-DL-NEXT: s_endpgm

				<8 x i4> addrspace(1)* %src2,
				i32 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%cvec1 = zext <8 x i4> %vec1 to <8 x i32>
				%cvec2 = zext <8 x i4> %vec2 to <8 x i32>

				%mul = mul <8 x i32> %cvec1, %cvec2
				%mul0 = extractelement <8 x i32> %mul, i64 0
				%mul1 = extractelement <8 x i32> %mul, i64 1
				%mul2 = extractelement <8 x i32> %mul, i64 2
				%mul3 = extractelement <8 x i32> %mul, i64 3
				%mul4 = extractelement <8 x i32> %mul, i64 4
				%mul5 = extractelement <8 x i32> %mul, i64 5
				%mul6 = extractelement <8 x i32> %mul, i64 6
				%mul7 = extractelement <8 x i32> %mul, i64 7

				%acc = load i32, i32 addrspace(1)* %dst, align 4
				%add1 = add i32 %mul0, %acc
				%add2 = add i32 %add1, %mul1
				%add3 = add i32 %add2, %mul2
				%add4 = add i32 %add3, %mul3
				%add5 = add i32 %add4, %mul4
				%add6 = add i32 %add5, %mul5
				%add7 = add i32 %add6, %mul6
				%add8 = add i32 %add7, %mul7

				store i32 %add8, i32 addrspace(1)* %dst, align 4
				ret void
				}

				; TODO: Clean up the code(by default pk_mad_I16 should be generated), then
				; support the pattern.
				define amdgpu_kernel void @udot8_acc16_vecMul(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_acc16_vecMul:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx4
				; GCN-NEXT: s_load_dwordx2

				; GCN: s_load_dword
				; GCN: {{buffer\|flat\|global}}_load_ushort
				; GCN: s_waitcnt lgkmcnt(0)

				; GFX7: v_mul_u32_u24_e32 v2, s13, v2
				; GFX7-NEXT: v_mul_u32_u24_e32 v4, s11, v4
				; GFX7-NEXT: s_bfe_u32 s2, s0, 0x40014
				; GFX7-NEXT: s_bfe_u32 s8, s0, 0x40010
				; GFX7-NEXT: s_lshr_b32 s9, s0, 28
				; GFX7-NEXT: v_mov_b32_e32 v6, s16
				; GFX7-NEXT: s_bfe_u32 s10, s0, 0x40018
				; GFX7-NEXT: s_and_b32 s12, s0, 15
				; GFX7-NEXT: v_mov_b32_e32 v3, s19
				; GFX7-NEXT: s_bfe_u32 s0, s0, 0x40008
				; GFX7-NEXT: v_mov_b32_e32 v1, s1
				; GFX7-NEXT: v_mov_b32_e32 v5, s17
				; GFX7-NEXT: v_mul_u32_u24_e32 v6, s9, v6
				; GFX7-NEXT: v_mul_u32_u24_e32 v1, s0, v1
				; GFX7-NEXT: v_lshlrev_b32_e32 v2, 16, v2
				; GFX7-NEXT: v_mul_u32_u24_e32 v3, s12, v3
				; GFX7-NEXT: v_lshlrev_b32_e32 v4, 16, v4
				; GFX7-NEXT: v_or_b32_e32 v1, v1, v2
				; GFX7-NEXT: v_or_b32_e32 v2, v3, v4
				; GFX7-NEXT: v_mul_u32_u24_e32 v5, s10, v5
				; GFX7-NEXT: v_lshlrev_b32_e32 v6, 16, v6
				; GFX7-NEXT: v_mov_b32_e32 v8, s14
				; GFX7-NEXT: v_or_b32_e32 v3, v5, v6
				; GFX7-NEXT: v_alignbit_b32 v5, v1, v2, 16
				; GFX7-NEXT: v_mov_b32_e32 v7, s15
				; GFX7-NEXT: v_mul_u32_u24_e32 v8, s2, v8
				; GFX7-NEXT: v_mul_u32_u24_e32 v7, s8, v7
				; GFX7-NEXT: v_lshlrev_b32_e32 v8, 16, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v6, 16, v1
				; GFX7-NEXT: v_or_b32_e32 v4, v7, v8
				; GFX7-NEXT: v_lshrrev_b32_e32 v7, 16, v4
				; GFX7-NEXT: v_lshrrev_b32_e32 v8, 16, v3
				; GFX7-NEXT: s_waitcnt vmcnt(0)
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, v0, v2
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, v5, v0
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, v0, v1
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, v6, v0
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, v4, v0
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, v7, v0
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, v3, v0
				; GFX7-NEXT: v_add_i32_e32 v0, vcc, v8, v0

				; GFX8: v_mad_u32_u24 v2, s0, v3, v2
				; GFX8-NEXT: v_mad_u32_u24 v2, s1, v4, v2
				; GFX8-NEXT: v_and_b32_e32 v2, 0xffff, v2
				; GFX8-NEXT: v_mad_u32_u24 v2, s6, v5, v2
				; GFX8-NEXT: v_mad_u32_u24 v2, s7, v6, v2
				; GFX8-NEXT: v_mad_u32_u24 v2, s9, v7, v2
				; GFX8-NEXT: v_mad_u32_u24 v2, s11, v8, v2
				; GFX8-NEXT: v_mad_u32_u24 v2, s13, v9, v2
				; GFX8-NEXT: v_mov_b32_e32 v3, s14
				; GFX8-NEXT: v_mad_u32_u24 v2, s2, v3, v2

				; GFX9: s_and_b32 s0, s2, 15
				; GFX9-NEXT: s_and_b32 s1, s4, 15
				; GFX9-NEXT: s_bfe_u32 s5, s4, 0x40004
				; GFX9-NEXT: s_pack_ll_b32_b16 s1, s1, s5
				; GFX9-NEXT: s_bfe_u32 s6, s2, 0x40004
				; GFX9-NEXT: s_pack_ll_b32_b16 s0, s0, s6
				; GFX9-NEXT: v_mov_b32_e32 v3, s1
				; GFX9-NEXT: s_bfe_u32 s5, s4, 0x40008
				; GFX9-NEXT: s_bfe_u32 s7, s4, 0x4000c
				; GFX9-NEXT: s_pack_ll_b32_b16 s5, s5, s7
				; GFX9-NEXT: v_pk_mul_lo_u16 v3, s0, v3
				; GFX9-NEXT: s_bfe_u32 s1, s2, 0x40008
				; GFX9-NEXT: s_bfe_u32 s6, s2, 0x4000c
				; GFX9-NEXT: s_pack_ll_b32_b16 s1, s1, s6
				; GFX9-NEXT: v_mov_b32_e32 v4, s5
				; GFX9-NEXT: s_bfe_u32 s0, s4, 0x40010
				; GFX9-NEXT: s_bfe_u32 s7, s4, 0x40014
				; GFX9-NEXT: s_pack_ll_b32_b16 s0, s0, s7
				; GFX9-NEXT: v_pk_mul_lo_u16 v4, s1, v4
				; GFX9-NEXT: s_bfe_u32 s5, s2, 0x40010
				; GFX9-NEXT: s_bfe_u32 s6, s2, 0x40014
				; GFX9-NEXT: s_bfe_u32 s1, s4, 0x40018
				; GFX9-NEXT: s_lshr_b32 s4, s4, 28
				; GFX9-NEXT: v_mov_b32_e32 v5, s0
				; GFX9-NEXT: s_pack_ll_b32_b16 s5, s5, s6
				; GFX9-NEXT: s_bfe_u32 s0, s2, 0x40018
				; GFX9-NEXT: s_lshr_b32 s2, s2, 28
				; GFX9-NEXT: s_pack_ll_b32_b16 s1, s1, s4
				; GFX9-NEXT: v_pk_mul_lo_u16 v5, s5, v5
				; GFX9-NEXT: s_pack_ll_b32_b16 s0, s0, s2
				; GFX9-NEXT: v_mov_b32_e32 v6, s1
				; GFX9-NEXT: v_pk_mul_lo_u16 v6, s0, v6
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_add_u32_e32 v2, v3, v2
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:BYTE_0
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
				; GFX9-NEXT: v_add_u32_e32 v2, v2, v5
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
				; GFX9-NEXT: v_add_u32_e32 v2, v2, v6
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1

				; GCN: {{flat\|buffer\|global}}_store_short
				; GCN-NEXT: s_endpgm
				<8 x i4> addrspace(1)* %src2,
				i16 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%cvec1 = zext <8 x i4> %vec1 to <8 x i16>
				%cvec2 = zext <8 x i4> %vec2 to <8 x i16>

				%mul = mul <8 x i16> %cvec1, %cvec2
				%mul0 = extractelement <8 x i16> %mul, i64 0
				%mul1 = extractelement <8 x i16> %mul, i64 1
				%mul2 = extractelement <8 x i16> %mul, i64 2
				%mul3 = extractelement <8 x i16> %mul, i64 3
				%mul4 = extractelement <8 x i16> %mul, i64 4
				%mul5 = extractelement <8 x i16> %mul, i64 5
				%mul6 = extractelement <8 x i16> %mul, i64 6
				%mul7 = extractelement <8 x i16> %mul, i64 7

				%acc = load i16, i16 addrspace(1)* %dst, align 4
				%add1 = add i16 %mul0, %acc
				%add2 = add i16 %add1, %mul1
				%add3 = add i16 %add2, %mul2
				%add4 = add i16 %add3, %mul3
				%add5 = add i16 %add4, %mul4
				%add6 = add i16 %add5, %mul5
				%add7 = add i16 %add6, %mul6
				%add8 = add i16 %add7, %mul7

				store i16 %add8, i16 addrspace(1)* %dst, align 4
				ret void
				}

				; TODO: Cleanup the code to generate MAD; pattern should be recognized then.
				define amdgpu_kernel void @udot8_acc8_vecMul(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_acc8_vecMul:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx4
				; GCN-NEXT: s_load_dwordx2
				; GCN: s_load_dword
				; GCN: {{buffer\|flat\|global}}_load_ubyte
				; GCN: s_waitcnt lgkmcnt(0)

				; GFX78: v_mul_u32_u24{{_sdwa\|_e32}}
				; GFX78: v_mul_u32_u24{{_sdwa\|_e32}}
				; GFX78: v_mul_u32_u24{{_sdwa\|_e32}}
				; GFX78: v_mul_u32_u24{{_sdwa\|_e32}}
				; GFX78: v_mul_u32_u24{{_sdwa\|_e32}}
				; GFX78: v_mul_u32_u24{{_sdwa\|_e32}}
				; GFX78: v_mul_u32_u24{{_sdwa\|_e32}}
				; GFX78: v_mul_u32_u24{{_sdwa\|_e32}}
				; GFX78: v_add_{{i\|u}}32_e32
				; GFX78: v_add_{{i\|u}}32_e32
				; GFX78: v_add_{{i32_e32\|u32_sdwa}}
				; GFX78: v_add_{{i32_e32\|u32_sdwa}}
				; GFX78: v_add_{{i\|u}}32_e32
				; GFX78: v_add_{{i\|u}}32_e32
				; GFX78: v_add_{{i32_e32\|u32_sdwa}}
				; GFX78: v_add_{{i32_e32\|u32_sdwa}}

				; GFX9: v_mul_lo_u16_e32 v3, s0, v3
				; GFX9-NEXT: v_mul_lo_u16_sdwa v4, s8, v4 dst_sel:BYTE_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
				; GFX9-NEXT: v_mul_lo_u16_e32 v5, s9, v5
				; GFX9-NEXT: v_mul_lo_u16_sdwa v6, s10, v6 dst_sel:BYTE_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
				; GFX9-NEXT: v_or_b32_e32 v3, v3, v4
				; GFX9-NEXT: v_or_b32_sdwa v4, v5, v6 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
				; GFX9-NEXT: s_bfe_u32 s1, s4, 0x40014
				; GFX9-NEXT: s_bfe_u32 s5, s4, 0x40018
				; GFX9-NEXT: s_bfe_u32 s0, s4, 0x40010
				; GFX9-NEXT: s_lshr_b32 s4, s4, 28
				; GFX9-NEXT: v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
				; GFX9-NEXT: s_bfe_u32 s6, s2, 0x40010
				; GFX9-NEXT: v_mov_b32_e32 v4, s0
				; GFX9-NEXT: s_bfe_u32 s7, s2, 0x40014
				; GFX9-NEXT: v_mov_b32_e32 v5, s1
				; GFX9-NEXT: s_bfe_u32 s8, s2, 0x40018
				; GFX9-NEXT: v_mov_b32_e32 v6, s5
				; GFX9-NEXT: s_lshr_b32 s2, s2, 28
				; GFX9-NEXT: v_mov_b32_e32 v7, s4
				; GFX9-NEXT: v_mul_lo_u16_e32 v4, s6, v4
				; GFX9-NEXT: v_mul_lo_u16_sdwa v5, s7, v5 dst_sel:BYTE_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
				; GFX9-NEXT: v_mul_lo_u16_e32 v6, s8, v6
				; GFX9-NEXT: v_mul_lo_u16_sdwa v7, s2, v7 dst_sel:BYTE_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
				; GFX9-NEXT: v_or_b32_e32 v4, v4, v5
				; GFX9-NEXT: v_or_b32_sdwa v5, v6, v7 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
				; GFX9-NEXT: v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
				; GFX9-NEXT: v_lshrrev_b32_e32 v5, 8, v3
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_add_u32_e32 v2, v3, v2
				; GFX9-NEXT: v_add_u32_e32 v2, v2, v5
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
				; GFX9-NEXT: v_add_u32_e32 v2, v2, v4
				; GFX9-NEXT: v_lshrrev_b32_e32 v3, 8, v4
				; GFX9-NEXT: v_add_u32_e32 v2, v2, v3
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
				; GFX9-NEXT: v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3

				; GCN: {{flat\|buffer\|global}}_store_byte
				; GCN-NEXT: s_endpgm
				<8 x i4> addrspace(1)* %src2,
				i8 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%cvec1 = zext <8 x i4> %vec1 to <8 x i8>
				%cvec2 = zext <8 x i4> %vec2 to <8 x i8>

				%mul = mul <8 x i8> %cvec1, %cvec2
				%mul0 = extractelement <8 x i8> %mul, i64 0
				%mul1 = extractelement <8 x i8> %mul, i64 1
				%mul2 = extractelement <8 x i8> %mul, i64 2
				%mul3 = extractelement <8 x i8> %mul, i64 3
				%mul4 = extractelement <8 x i8> %mul, i64 4
				%mul5 = extractelement <8 x i8> %mul, i64 5
				%mul6 = extractelement <8 x i8> %mul, i64 6
				%mul7 = extractelement <8 x i8> %mul, i64 7

				%acc = load i8, i8 addrspace(1)* %dst, align 4
				%add1 = add i8 %mul0, %acc
				%add2 = add i8 %add1, %mul1
				%add3 = add i8 %add2, %mul2
				%add4 = add i8 %add3, %mul3
				%add5 = add i8 %add4, %mul4
				%add6 = add i8 %add5, %mul5
				%add7 = add i8 %add6, %mul6
				%add8 = add i8 %add7, %mul7

				store i8 %add8, i8 addrspace(1)* %dst, align 4
				ret void
				}

				; TODO: Once the adictional "and+add" are removed, the pattern will be recognized.
				define amdgpu_kernel void @udot8_acc4_vecMul(<8 x i4> addrspace(1)* %src1,
				; GCN-LABEL: udot8_acc4_vecMul:
				; GCN: ; %bb.0: ; %entry
				; GCN-NEXT: s_load_dwordx4
				; GCN-NEXT: s_load_dwordx2
				; GCN: s_load_dword
				; GCN: {{buffer\|flat\|global}}_load_ubyte
				; GCN: s_waitcnt lgkmcnt(0)

				; GCN: s_waitcnt vmcnt(0)
				; GCN-NEXT: v_mad_u32_u24 [[MAD1:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GCN: v_mad_u32_u24 [[MAD2:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}

				; GCN-DL: v_and_b32_e32 v2, 15, v2
				; GCN-DL-NEXT: v_add_u32_e32 v2, v2, v4

				; GCN: v_mad_u32_u24 [[MAD3:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GCN: v_mad_u32_u24 [[MAD4:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GCN: v_mad_u32_u24 [[MAD5:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GCN: v_mad_u32_u24 [[MAD6:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GCN: v_mov_b32_e32
				; GCN: v_mad_u32_u24 [[MAD7:v[0-9]+]], s{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
				; GCN: v_and_b32_e32
				; GCN: {{buffer\|flat\|global}}_store_byte
				; GCN: s_endpgm

				<8 x i4> addrspace(1)* %src2,
				i4 addrspace(1)* nocapture %dst) {
				entry:
				%vec1 = load <8 x i4>, <8 x i4> addrspace(1)* %src1
				%vec2 = load <8 x i4>, <8 x i4> addrspace(1)* %src2

				%mul = mul <8 x i4> %vec1, %vec2
				%mul0 = extractelement <8 x i4> %mul, i64 0
				%mul1 = extractelement <8 x i4> %mul, i64 1
				%mul2 = extractelement <8 x i4> %mul, i64 2
				%mul3 = extractelement <8 x i4> %mul, i64 3
				%mul4 = extractelement <8 x i4> %mul, i64 4
				%mul5 = extractelement <8 x i4> %mul, i64 5
				%mul6 = extractelement <8 x i4> %mul, i64 6
				%mul7 = extractelement <8 x i4> %mul, i64 7

				%acc = load i4, i4 addrspace(1)* %dst, align 4
				%add1 = add i4 %mul0, %acc
				%add2 = add i4 %add1, %mul1
				%add3 = add i4 %add2, %mul2
				%add4 = add i4 %add3, %mul3
				%add5 = add i4 %add4, %mul4
				%add6 = add i4 %add5, %mul5
				%add7 = add i4 %add6, %mul6
				%add8 = add i4 %add7, %mul7

				store i4 %add8, i4 addrspace(1)* %dst, align 4
				ret void
				}