This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsAMDGPU.td
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPURegisterBankInfo.cpp
2/4
VOP3PInstructions.td
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
-
llvm.amdgcn.wmma_32.ll
-
llvm.amdgcn.wmma_64.ll
-
llvm.amdgcn.wmma_32.ll
-
llvm.amdgcn.wmma_64.ll

Differential D158059

[AMDGPU/wmma] - Disable 3-address syntax for f16
Needs RevisionPublic

Authored by OutOfCache on Aug 16 2023, 2:14 AM.

Download Raw Diff

Details

Reviewers

piotr
nhaehnle
arsenm

Group Reviewers

Restricted Project

Summary

Always keep wmma instructions with a 16-bit floating-point accumulator
as two-address instruction.

This is a prerequesite for an upcoming optimization for wmma with 16-bit
accumulator matrices.
We want to pack the results of two separate
wmmas into the same register, so one matrix
is in the lower half while the other matrix is
in the upper half of the registers.

We pack the values into the registers before using them
in the first wmma as input:

v_wmma_f16_16x16x16_f16 v[0:7], v[8:15], v[16:23], v[0:7]
v_wmma_f16_16x16x16_f16 v[0:7], v[24:31], v[32:49], v[0:7] op_sel:[0,0,1]

Therefore, both instructions need to write to the same registers
and overwrite the values of the input matrices.

We have verified the correct behavior by
running nod.ai's Stable Diffusion with these
changes in data layout.
On average, this change reduced the vgpr count by 17.17% (in 88 shaders
that the change applied to).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

OutOfCache created this revision.Aug 16 2023, 2:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2023, 2:14 AM

Herald added subscribers: foad, kerbowa, hiraditya and 6 others. · View Herald Transcript

OutOfCache requested review of this revision.Aug 16 2023, 2:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2023, 2:14 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

OutOfCache added reviewers: piotr, nhaehnle.Aug 16 2023, 2:15 AM

arsenm added inline comments.Aug 16 2023, 4:32 AM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
874–875	Shouldn’t lie about properties, disable at a different point?

Harbormaster completed remote builds in B252889: Diff 550670.Aug 16 2023, 5:30 AM

OutOfCache added inline comments.Aug 16 2023, 6:40 AM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
874–875	Where would you suggest?

arsenm added inline comments.Aug 17 2023, 3:33 PM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
874–875	Does this happen in two address instructions? I assume the current heuristic assumes a simple register and isn’t accounting for large tuple increasing pressure

OutOfCache edited the summary of this revision. (Show Details)Aug 18 2023, 1:47 AM

OutOfCache added inline comments.Aug 18 2023, 1:54 AM

llvm/lib/Target/AMDGPU/VOP3PInstructions.td
874–875	Sorry, what do you mean by 'does this happen'? If you are talking about the conversion from two address to three address instruction, then yes, it can happen. I encountered issues when compiling Stable Diffusion shaders that initialize the matrices via the `zeroinitializer` constant outside a loop. In the loop case, each matrix receives a dedicated zero matrix. Without the loop, the same zero matrix is reused for multiple `wmma`s. So we have the following scenario: v_wmma_f16_16x16x16_f16 v[0:7], ..., v[24:31] v_wmma_f16_16x16x16_f16 v[32:30], ..., v[24:31] So we use `v[24:31` as zero matrix as input, but a different destination matrix (e.g., `v[0:7]`). The problem arises once we try to pack: v_wmma_f16_16x16x16_f16 v[0:7], ..., v[0:7] op_sel [0,0,1] We have no guarantee that the upper halves of `v[0:7]` are zero initialized. In practice, this indeed caused issues (completely black images as SD output). This change fixed it. You are right that disabling the ability to convert is not the best solution, and I would be happy to move this change elsewhere. Currently I don't know a better place, though. Can we maybe adapt the heuristics to always keep the two address mode for constant input matrices like the `zeroinitializer`? Or does that not make sense either?

Updated test

Herald added a subscriber: mstorsjo. · View Herald TranscriptAug 18 2023, 6:14 AM

Harbormaster completed remote builds in B253472: Diff 551484.Aug 18 2023, 7:22 AM

I am trying to understand the failing case better. Can the issue only happen with the extra patch with packing? Is the issue only with zeroinitializers (constant matrices), or it is just where the problem was found?

It would be preferable to keep the 3-address syntax, as it can produce better code (see the extra moves in the updated ir tests).

Sorry for responding so late.

The problem was only found with zeroinitializer matrices, which were reused as input for multiple wmma_f16 instructions. However, it could happen for other, non-constant matrices as well, as long as the input and output accumulator registers are different.

After further discussion with other compiler engineers, we want to add a new pseudo instruction for the tied instruction. Then we can update the intrinsics in the packing patch to lower to these specific pseudos. That way, the original wmma_f16 instruction can still be a three-address instruction in cases outside the patch.

[AMDGPU/VOP3P] - Simplify wmma instruction defs
[AMDGPU/VOP3P] - Add tied wmma_f16 pseudos

Use new pseudo instructions instead of modifying existing ones.

Thanks, that would avoid the regression. However, I still do not fully understand the failing mode - can you show the test case + extra code that triggers the issue?

The patch as-is needs some codegen tests for the new intrinsics. (I have no opinion on whether adding new intrinsics is actually the right approach here.)

Added codegen tests for new intrinsics

In D158059#4640554, @piotr wrote:

Thanks, that would avoid the regression. However, I still do not fully understand the failing mode - can you show the test case + extra code that triggers the issue?

I admit, it is probably hard to follow without the full context. I'll try again.

During cooperative matrix calculation, we typically have a loop. This loop calculates one or more accumulator matrices.
Every iteration calculates the partial result of the accumulator, while iterating over different factor matrices.

However, some shaders do not use a loop.

Suppose we have multiple wmma instructions. They all use the same C matrix. In our case, they all use a zero matrix as input at first, because there is no previous result.

v_mov v24, 0
; ...
; v[24:31] has all zeros
v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[24:31]
v_wmma_f16_16x16x16_f16 v[32:39], ..., v[24:31]

Since these are the wmma_f16 instructions, they write the result into the lower 16 bit of the result registers.
Therefore, after the instructions, the content of v0 is 0xIJKL????, where IJKL is the result of the wmma, while the ???? is the previous content of v0, from before the wmma.
In other words, even though the input accumulator has all zeros in its registers, these zeros are not copied into the result registers.

Then, we have other wmmas, which update the result matrices. We take the previous result as input accumulator, and swap out the factor matrices for different ones.

v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[0:7]
v_wmma_f16_16x16x16_f16 v[32:39], ..., v[32:39]

After that, the content of v0 is 0xPQRS????.

Usually, this is not an issue. But the future patch tries to fill the ???? with the values of another matrix, so we can save VGPRs.
We can do that thanks to the op_sel argument. We want the matrix values inside v[32:39] to be inside v[0:7].
Essentally, we want this:

v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[0:7]
v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[0:7] op_sel [0, 0, 1] ; instead of v[32:39]

The second instructions reads from the upper 16 bit of v0, and also writes the result into the upper 16 bit of v0.
That normally does not cause an issue. However, when we intially calculated v0 with the first wmma, we had the content 0xIJKL????.
Now, any wmma with the op_sel tries to read the upper sixteen bit, the ???? as input accumulator. These upper bits were not initialized to zeros.

That is the root of the issue, since now the previous content of the registers is added to the wmma result, and not 0 as expected.

Tying the input accumulator to the result accumulator solves this issue. We no longer have a separate zero matrix, which we reuse over multiple wmmas.
Instead, we correctly initialize the output matrix to all zeros and then calculate the results.

v_mov v0, 0
; ...
v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[0:7]

After that, the content of v0 is now 0xIJKL0000, as expected.

Does that clear things up? Let me know if I skipped some detail.

Harbormaster completed remote builds in B256851: Diff 556247.Sep 8 2023, 6:18 AM

This shouldn't call for introducing new intrinsics

This revision now requires changes to proceed.Sep 8 2023, 7:33 AM

Is the problem the rematerialization doesn't understand the liveness with op_sel?

I'd be happy to change the approach, but I can't think of a better way to preserve the old behavior while also guaranteeing the correct initialization of register values.
I assume the current behavior of wmma (only writing to one half of the register while leaving the other half untouched) is correct, or should it copy the content of the input accumulator into the other half?

I’ll take a try at explaining what the new tied-intrinsic does and where it is useful:

Our frontend sees matrix multiplications of multiple, different 16-bit matrices. Each of these matrices takes 8 VGPRs. However, a 16-bit matrix only uses either the high or the low half of each of the 8 VGPRs.
So, what our compiler tries to do, is merging a pair of two independent 16-bit matrices into 8 VGPRs. One matrix taking the low half of each register, one matrix taking the high half.

In IR, we have intrinsics that work on such a combined matrix-pair:

%combined = call <8 x float> @combine_halfs(<8 x float> %lo_16_bit_matrix, <8 x float> %hi_16_bit_matrix)  ; type is like <16 x half> with every second half in use
%low_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 false /* low half */)
%high_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %low_half_multiplied, i1 true /* high half */)

So far so good, we divide VGPR usage by two, now we need to lower our @wmma intrinsic to llvm.amdgcn intrinsics.
For this, we need a wmma intrinsic that uses the low or high half of our value as accumulator (c matrix) and preserves the other half of the value.
If we do not preserve the other half, we would loose the second part of the packed matrix.

This is where the new tied intrinsic comes into play, it guarantees to preserve the “untouched” part of the value by tying input and output to the same physical VGPRs.

Thanks for the extra info; I understand the problem now: currently there seems to be no way to take advantage of the opsel bit to reuse the same destination matrix registers for two wmma instructions.

One way to fix that would be as proposed here, at the intrinsic level. In this approach the intrinsic would always take the full register (C, D matrices), operate on the specified half, and preserve the other half.

But maybe this can be fixed in the codegen without adding the new intrinsic? I think the underlying problem is that we are not modelling correctly the fact that these opsel instructions do not touch the other half. If we could do that I would expect the two address pass to re-use the dest regs of the first instruction, because the low halves would still be live at the end of the second instruction.

%combined = call <8 x float> @combine_halfs(<8 x float> %lo_16_bit_matrix, <8 x float> %hi_16_bit_matrix)
%low_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 false /* low half */)
%high_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 true /* high half */)

That's true @piotr, that is the underlying issue. The twoaddress pass needs to somehow notice this behavior. It makes sense to tackle this problem at the root, so I will look into the current decision process in the pass and think about solutions.

I am open to hear any suggestions. Has there been a case similar to this before?

In D158059#4643932, @piotr wrote:

Thanks for the extra info; I understand the problem now: currently there seems to be no way to take advantage of the opsel bit to reuse the same destination matrix registers for two wmma instructions.

One way to fix that would be as proposed here, at the intrinsic level. In this approach the intrinsic would always take the full register (C, D matrices), operate on the specified half, and preserve the other half.

But maybe this can be fixed in the codegen without adding the new intrinsic? I think the underlying problem is that we are not modelling correctly the fact that these opsel instructions do not touch the other half. If we could do that I would expect the two address pass to re-use the dest regs of the first instruction, because the low halves would still be live at the end of the second instruction.

So, if you have multiple packed operations in a row, you would need to “re-combine” the matrices?
I.e.

%combined = call <8 x float> @combine_halfs(<8 x float> %lo_16_bit_matrix, <8 x float> %hi_16_bit_matrix)
%low_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 false /* low half */)
%high_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 true /* high half */)
%combined2 = call <8 x float> @combine_halfs(<8 x float> %low_half_multiplied, <8 x float> %high_half_multiplied)
%low_half_multiplied2 = call <8 x float> @wmma(%a, %b, <8 x float> %combined2, i1 false /* low half */)
%high_half_multiplied2 = call <8 x float> @wmma(%a, %b, <8 x float> %combined2, i1 true /* high half */)

Then, as you say, our register allocation needs to be intelligent enough to keep the matrices packed.
How would you define the instructions for this to work?

Then, as you say, our register allocation needs to be intelligent enough to keep the matrices packed.
How would you define the instructions for this to work?

Unfortunately, looking at that a bit more I don't think the scheme I proposed is feasible. Even if we add some extra copies to preserve the other half, the twoaddressinstruction pass will not be able to understand that.

The only alternative I could suggest instead of adding new intrinsics seems to be to implement the packing entirely in the codegen (e.g. after twoaddressinstruction pass).

In D158059#4647410, @piotr wrote:

Then, as you say, our register allocation needs to be intelligent enough to keep the matrices packed.
How would you define the instructions for this to work?

Unfortunately, looking at that a bit more I don't think the scheme I proposed is feasible. Even if we add some extra copies to preserve the other half, the twoaddressinstruction pass will not be able to understand that.

The only alternative I could suggest instead of adding new intrinsics seems to be to implement the packing entirely in the codegen (e.g. after twoaddressinstruction pass).

Thank you for looking into it! I don't know if the packing is feasible at that stage, though. The packing affects the users of the matrices as well, and we base our solution heavily on finding the users of the lgc intrinisics. Plus, the changes can remove some intrinsic calls entirely, so the code size is reduced in any further pass. Delaying the changes to such a late stage makes the packing process more complicated and we would keep more code in each pass, which might be removed later on.

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

2 lines

lib/

Target/

AMDGPU/

AMDGPURegisterBankInfo.cpp

2 lines

VOP3PInstructions.td

57 lines

test/

CodeGen/

AMDGPU/

GlobalISel/

llvm.amdgcn.wmma_32.ll

100 lines

llvm.amdgcn.wmma_64.ll

84 lines

llvm.amdgcn.wmma_32.ll

100 lines

llvm.amdgcn.wmma_64.ll

84 lines

Diff 556247

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

Show First 20 Lines • Show All 2,358 Lines • ▼ Show 20 Lines	Intrinsic<
],		],
[IntrNoMem, IntrConvergent, ImmArg<ArgIndex<0>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<5>>, IntrWillReturn, IntrNoCallback, IntrNoFree]		[IntrNoMem, IntrConvergent, ImmArg<ArgIndex<0>>, ImmArg<ArgIndex<2>>, ImmArg<ArgIndex<5>>, IntrWillReturn, IntrNoCallback, IntrNoFree]
>;		>;

def int_amdgcn_wmma_f32_16x16x16_f16 : AMDGPUWmmaIntrinsic<llvm_v16f16_ty, llvm_anyfloat_ty>;		def int_amdgcn_wmma_f32_16x16x16_f16 : AMDGPUWmmaIntrinsic<llvm_v16f16_ty, llvm_anyfloat_ty>;
def int_amdgcn_wmma_f32_16x16x16_bf16 : AMDGPUWmmaIntrinsic<llvm_v16i16_ty, llvm_anyfloat_ty>;		def int_amdgcn_wmma_f32_16x16x16_bf16 : AMDGPUWmmaIntrinsic<llvm_v16i16_ty, llvm_anyfloat_ty>;
def int_amdgcn_wmma_f16_16x16x16_f16 : AMDGPUWmmaIntrinsicOPSEL<llvm_v16f16_ty, llvm_anyfloat_ty>;		def int_amdgcn_wmma_f16_16x16x16_f16 : AMDGPUWmmaIntrinsicOPSEL<llvm_v16f16_ty, llvm_anyfloat_ty>;
def int_amdgcn_wmma_bf16_16x16x16_bf16 : AMDGPUWmmaIntrinsicOPSEL<llvm_v16i16_ty, llvm_anyint_ty>;		def int_amdgcn_wmma_bf16_16x16x16_bf16 : AMDGPUWmmaIntrinsicOPSEL<llvm_v16i16_ty, llvm_anyint_ty>;
		def int_amdgcn_wmma_tied_f16_16x16x16_f16 : AMDGPUWmmaIntrinsicOPSEL<llvm_v16f16_ty, llvm_anyfloat_ty>;
		def int_amdgcn_wmma_tied_bf16_16x16x16_bf16 : AMDGPUWmmaIntrinsicOPSEL<llvm_v16i16_ty, llvm_anyint_ty>;
def int_amdgcn_wmma_i32_16x16x16_iu8 : AMDGPUWmmaIntrinsicIU<llvm_v4i32_ty, llvm_anyint_ty>;		def int_amdgcn_wmma_i32_16x16x16_iu8 : AMDGPUWmmaIntrinsicIU<llvm_v4i32_ty, llvm_anyint_ty>;
def int_amdgcn_wmma_i32_16x16x16_iu4 : AMDGPUWmmaIntrinsicIU<llvm_v2i32_ty, llvm_anyint_ty>;		def int_amdgcn_wmma_i32_16x16x16_iu4 : AMDGPUWmmaIntrinsicIU<llvm_v2i32_ty, llvm_anyint_ty>;

def int_amdgcn_s_wait_event_export_ready :		def int_amdgcn_s_wait_event_export_ready :
ClangBuiltin<"__builtin_amdgcn_s_wait_event_export_ready">,		ClangBuiltin<"__builtin_amdgcn_s_wait_event_export_ready">,
Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects, IntrWillReturn]		Intrinsic<[], [], [IntrNoMem, IntrHasSideEffects, IntrWillReturn]
>;		>;

▲ Show 20 Lines • Show All 403 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp

Show First 20 Lines • Show All 4,246 Lines • ▼ Show 20 Lines	case AMDGPU::G_INTRINSIC_CONVERGENT: {
case Intrinsic::amdgcn_udot8:		case Intrinsic::amdgcn_udot8:
case Intrinsic::amdgcn_fdot2_bf16_bf16:		case Intrinsic::amdgcn_fdot2_bf16_bf16:
case Intrinsic::amdgcn_fdot2_f16_f16:		case Intrinsic::amdgcn_fdot2_f16_f16:
case Intrinsic::amdgcn_fdot2_f32_bf16:		case Intrinsic::amdgcn_fdot2_f32_bf16:
case Intrinsic::amdgcn_sudot4:		case Intrinsic::amdgcn_sudot4:
case Intrinsic::amdgcn_sudot8:		case Intrinsic::amdgcn_sudot8:
case Intrinsic::amdgcn_wmma_bf16_16x16x16_bf16:		case Intrinsic::amdgcn_wmma_bf16_16x16x16_bf16:
case Intrinsic::amdgcn_wmma_f16_16x16x16_f16:		case Intrinsic::amdgcn_wmma_f16_16x16x16_f16:
		case Intrinsic::amdgcn_wmma_tied_bf16_16x16x16_bf16:
		case Intrinsic::amdgcn_wmma_tied_f16_16x16x16_f16:
case Intrinsic::amdgcn_wmma_f32_16x16x16_bf16:		case Intrinsic::amdgcn_wmma_f32_16x16x16_bf16:
case Intrinsic::amdgcn_wmma_f32_16x16x16_f16:		case Intrinsic::amdgcn_wmma_f32_16x16x16_f16:
case Intrinsic::amdgcn_wmma_i32_16x16x16_iu4:		case Intrinsic::amdgcn_wmma_i32_16x16x16_iu4:
case Intrinsic::amdgcn_wmma_i32_16x16x16_iu8:		case Intrinsic::amdgcn_wmma_i32_16x16x16_iu8:
return getDefaultMappingVOP(MI);		return getDefaultMappingVOP(MI);
case Intrinsic::amdgcn_sbfe:		case Intrinsic::amdgcn_sbfe:
case Intrinsic::amdgcn_ubfe:		case Intrinsic::amdgcn_ubfe:
if (isSALUMapping(MI))		if (isSALUMapping(MI))
▲ Show 20 Lines • Show All 623 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/VOP3PInstructions.td

	Show First 20 Lines • Show All 859 Lines • ▼ Show 20 Lines
	// but it is OK for them to be the same (which is a typical case).			// but it is OK for them to be the same (which is a typical case).
	//			//
	// We implement it as follows:			// We implement it as follows:
	// 1) Map the intrinsic to the pseudo where D is tied to C ($vdst = $src2).			// 1) Map the intrinsic to the pseudo where D is tied to C ($vdst = $src2).
	// 2) The pass twoaddressinstruction checks if src2 is live and if that is the case			// 2) The pass twoaddressinstruction checks if src2 is live and if that is the case
	// it converts the default pseudo to the pseudo where src2 is not the same as vdst.			// it converts the default pseudo to the pseudo where src2 is not the same as vdst.
	// 3) @earlyclobber on the destination satisfies the constraint during RA.			// 3) @earlyclobber on the destination satisfies the constraint during RA.

	multiclass WMMAInst<string Suffix, string Instr, VOPProfile P, SDPatternOperator node = null_frag, RegisterOperand _Src01RC64 = VRegSrc_256, WMMAType Type> {			multiclass WMMAInst<string Suffix, string Instr, VOPProfile P, SDPatternOperator node = null_frag, RegisterOperand _Src01RC64 = VRegSrc_256, WMMAType Type, bit convertibleTo3Addr> {

	defvar WMMAConstraints2Addr = "@earlyclobber $vdst,$vdst = $src2";			defvar WMMAConstraints2Addr = "@earlyclobber $vdst,$vdst = $src2";
	defvar WMMAConstraints3Addr = "@earlyclobber $vdst";			defvar WMMAConstraints3Addr = "@earlyclobber $vdst";

	defvar WMMAProfile = VOPProfileWMMA<P, Suffix, _Src01RC64, Type.hasClamp, Type.hasOpsel>;			defvar WMMAProfile = VOPProfileWMMA<P, Suffix, _Src01RC64, Type.hasClamp, Type.hasOpsel>;
	if !eq(Suffix, "_w32") then {
	let Mnemonic = Instr, mayRaiseFPException = 0, ReadsModeReg = 0 in {			let Mnemonic = Instr, mayRaiseFPException = 0, ReadsModeReg = 0 in {
	let Constraints = WMMAConstraints2Addr, isConvertibleToThreeAddress = 1 in {			let Constraints = WMMAConstraints2Addr, isConvertibleToThreeAddress = convertibleTo3Addr in {
				arsenmUnsubmitted Not Done Reply Inline Actions Shouldn’t lie about properties, disable at a different point? arsenm: Shouldn’t lie about properties, disable at a different point?
				OutOfCacheAuthorUnsubmitted Done Reply Inline Actions Where would you suggest? OutOfCache: Where would you suggest?
				arsenmUnsubmitted Not Done Reply Inline Actions Does this happen in two address instructions? I assume the current heuristic assumes a simple register and isn’t accounting for large tuple increasing pressure arsenm: Does this happen in two address instructions? I assume the current heuristic assumes a simple…
				OutOfCacheAuthorUnsubmitted Done Reply Inline Actions Sorry, what do you mean by 'does this happen'? If you are talking about the conversion from two address to three address instruction, then yes, it can happen. I encountered issues when compiling Stable Diffusion shaders that initialize the matrices via the `zeroinitializer` constant outside a loop. In the loop case, each matrix receives a dedicated zero matrix. Without the loop, the same zero matrix is reused for multiple `wmma`s. So we have the following scenario: v_wmma_f16_16x16x16_f16 v[0:7], ..., v[24:31] v_wmma_f16_16x16x16_f16 v[32:30], ..., v[24:31] So we use `v[24:31` as zero matrix as input, but a different destination matrix (e.g., `v[0:7]`). The problem arises once we try to pack: v_wmma_f16_16x16x16_f16 v[0:7], ..., v[0:7] op_sel [0,0,1] We have no guarantee that the upper halves of `v[0:7]` are zero initialized. In practice, this indeed caused issues (completely black images as SD output). This change fixed it. You are right that disabling the ability to convert is not the best solution, and I would be happy to move this change elsewhere. Currently I don't know a better place, though. Can we maybe adapt the heuristics to always keep the two address mode for constant input matrices like the `zeroinitializer`? Or does that not make sense either? OutOfCache: Sorry, what do you mean by 'does this happen'? If you are talking about the conversion from two…
	def _twoaddr_w32 : VOP3P_Pseudo<Instr # Suffix, WMMAProfile>;			def _twoaddr # Suffix : VOP3P_Pseudo<Instr # Suffix, WMMAProfile>;
	}
	let Constraints = WMMAConstraints3Addr, SchedRW = [Write32Bit, Write32Bit] in {
	def _threeaddr_w32 : VOP3P_Pseudo<Instr # Suffix, WMMAProfile>;
	}			}
	}			}
	def : WMMAOpcodeMapping<!cast<Instruction>(NAME # _twoaddr_w32),			if !eq(convertibleTo3Addr, 1) then {
	!cast<Instruction>(NAME # _threeaddr_w32)>;
	} else if !eq(Suffix, "_w64") then {
	let Mnemonic = Instr, mayRaiseFPException = 0, ReadsModeReg = 0 in {			let Mnemonic = Instr, mayRaiseFPException = 0, ReadsModeReg = 0 in {
	let Constraints = WMMAConstraints2Addr, isConvertibleToThreeAddress = 1 in {
	def _twoaddr_w64 : VOP3P_Pseudo<Instr # Suffix, WMMAProfile>;
	}
	let Constraints = WMMAConstraints3Addr, SchedRW = [Write32Bit, Write32Bit] in {			let Constraints = WMMAConstraints3Addr, SchedRW = [Write32Bit, Write32Bit] in {
	def _threeaddr_w64 : VOP3P_Pseudo<Instr # Suffix, WMMAProfile>;			def _threeaddr # Suffix : VOP3P_Pseudo<Instr # Suffix, WMMAProfile>;
	}			}
	}			}
	def : WMMAOpcodeMapping<!cast<Instruction>(NAME # _twoaddr_w64),			def : WMMAOpcodeMapping<!cast<Instruction>(NAME # _twoaddr # Suffix),
	!cast<Instruction>(NAME # _threeaddr_w64)>;			!cast<Instruction>(NAME # _threeaddr # Suffix)>;
	}			}

	if !eq(Type, WMMAOpSel) then {			if !eq(Type, WMMAOpSel) then {
	def : WMMAOpSelPat<!cast<Instruction>(NAME # _twoaddr # Suffix), node, P>;			def : WMMAOpSelPat<!cast<Instruction>(NAME # _twoaddr # Suffix), node, P>;
	} else if !eq(Type, WMMAUIClamp) then {			} else if !eq(Type, WMMAUIClamp) then {
	def : WMMAUIClampPat<!cast<Instruction>(NAME # _twoaddr # Suffix), node, P>;			def : WMMAUIClampPat<!cast<Instruction>(NAME # _twoaddr # Suffix), node, P>;
	} else {			} else {
	def : WMMARegularPat<!cast<Instruction>(NAME # _twoaddr # Suffix), node, P>;			def : WMMARegularPat<!cast<Instruction>(NAME # _twoaddr # Suffix), node, P>;
	}			}
	}			}


	let WaveSizePredicate = isWave32 in {			let WaveSizePredicate = isWave32 in {
	defm V_WMMA_F32_16X16X16_F16 : WMMAInst<"_w32", "v_wmma_f32_16x16x16_f16", VOP_V8F32_V16F16_V16F16_V8F32, int_amdgcn_wmma_f32_16x16x16_f16, VRegSrc_256, WMMARegular>;			defm V_WMMA_F32_16X16X16_F16 : WMMAInst<"_w32", "v_wmma_f32_16x16x16_f16", VOP_V8F32_V16F16_V16F16_V8F32, int_amdgcn_wmma_f32_16x16x16_f16, VRegSrc_256, WMMARegular, 1>;
	defm V_WMMA_F32_16X16X16_BF16 : WMMAInst<"_w32", "v_wmma_f32_16x16x16_bf16", VOP_V8F32_V16I16_V16I16_V8F32, int_amdgcn_wmma_f32_16x16x16_bf16, VRegSrc_256, WMMARegular>;			defm V_WMMA_F32_16X16X16_BF16 : WMMAInst<"_w32", "v_wmma_f32_16x16x16_bf16", VOP_V8F32_V16I16_V16I16_V8F32, int_amdgcn_wmma_f32_16x16x16_bf16, VRegSrc_256, WMMARegular, 1>;
	defm V_WMMA_F16_16X16X16_F16 : WMMAInst<"_w32", "v_wmma_f16_16x16x16_f16", VOP_V16F16_V16F16_V16F16_V16F16, int_amdgcn_wmma_f16_16x16x16_f16, VRegSrc_256, WMMAOpSel>;			defm V_WMMA_F16_16X16X16_F16 : WMMAInst<"_w32", "v_wmma_f16_16x16x16_f16", VOP_V16F16_V16F16_V16F16_V16F16, int_amdgcn_wmma_f16_16x16x16_f16, VRegSrc_256, WMMAOpSel, 1>;
	defm V_WMMA_BF16_16X16X16_BF16 : WMMAInst<"_w32", "v_wmma_bf16_16x16x16_bf16", VOP_V16I16_V16I16_V16I16_V16I16, int_amdgcn_wmma_bf16_16x16x16_bf16, VRegSrc_256, WMMAOpSel>;			defm V_WMMA_BF16_16X16X16_BF16 : WMMAInst<"_w32", "v_wmma_bf16_16x16x16_bf16", VOP_V16I16_V16I16_V16I16_V16I16, int_amdgcn_wmma_bf16_16x16x16_bf16, VRegSrc_256, WMMAOpSel, 1>;
	defm V_WMMA_I32_16X16X16_IU8 : WMMAInst<"_w32", "v_wmma_i32_16x16x16_iu8", VOP_V8I32_V4I32_V4I32_V8I32, int_amdgcn_wmma_i32_16x16x16_iu8, VRegSrc_128, WMMAUIClamp>;			defm V_WMMA_TIED_F16_16X16X16_F16 : WMMAInst<"_w32", "v_wmma_f16_16x16x16_f16", VOP_V16F16_V16F16_V16F16_V16F16, int_amdgcn_wmma_tied_f16_16x16x16_f16, VRegSrc_256, WMMAOpSel, 0>;
	defm V_WMMA_I32_16X16X16_IU4 : WMMAInst<"_w32", "v_wmma_i32_16x16x16_iu4", VOP_V8I32_V2I32_V2I32_V8I32, int_amdgcn_wmma_i32_16x16x16_iu4, VRegSrc_64, WMMAUIClamp>;			defm V_WMMA_TIED_BF16_16X16X16_BF16 : WMMAInst<"_w32", "v_wmma_bf16_16x16x16_bf16", VOP_V16I16_V16I16_V16I16_V16I16, int_amdgcn_wmma_tied_bf16_16x16x16_bf16, VRegSrc_256, WMMAOpSel, 0>;
				defm V_WMMA_I32_16X16X16_IU8 : WMMAInst<"_w32", "v_wmma_i32_16x16x16_iu8", VOP_V8I32_V4I32_V4I32_V8I32, int_amdgcn_wmma_i32_16x16x16_iu8, VRegSrc_128, WMMAUIClamp, 1>;
				defm V_WMMA_I32_16X16X16_IU4 : WMMAInst<"_w32", "v_wmma_i32_16x16x16_iu4", VOP_V8I32_V2I32_V2I32_V8I32, int_amdgcn_wmma_i32_16x16x16_iu4, VRegSrc_64, WMMAUIClamp, 1>;
	}			}

	let WaveSizePredicate = isWave64 in {			let WaveSizePredicate = isWave64 in {
	defm V_WMMA_F32_16X16X16_F16 : WMMAInst<"_w64", "v_wmma_f32_16x16x16_f16", VOP_V4F32_V16F16_V16F16_V4F32, int_amdgcn_wmma_f32_16x16x16_f16, VRegSrc_256, WMMARegular>;			defm V_WMMA_F32_16X16X16_F16 : WMMAInst<"_w64", "v_wmma_f32_16x16x16_f16", VOP_V4F32_V16F16_V16F16_V4F32, int_amdgcn_wmma_f32_16x16x16_f16, VRegSrc_256, WMMARegular, 1>;
	defm V_WMMA_F32_16X16X16_BF16 : WMMAInst<"_w64", "v_wmma_f32_16x16x16_bf16", VOP_V4F32_V16I16_V16I16_V4F32, int_amdgcn_wmma_f32_16x16x16_bf16, VRegSrc_256, WMMARegular>;			defm V_WMMA_F32_16X16X16_BF16 : WMMAInst<"_w64", "v_wmma_f32_16x16x16_bf16", VOP_V4F32_V16I16_V16I16_V4F32, int_amdgcn_wmma_f32_16x16x16_bf16, VRegSrc_256, WMMARegular, 1>;
	defm V_WMMA_F16_16X16X16_F16 : WMMAInst<"_w64", "v_wmma_f16_16x16x16_f16", VOP_V8F16_V16F16_V16F16_V8F16, int_amdgcn_wmma_f16_16x16x16_f16, VRegSrc_256, WMMAOpSel>;			defm V_WMMA_F16_16X16X16_F16 : WMMAInst<"_w64", "v_wmma_f16_16x16x16_f16", VOP_V8F16_V16F16_V16F16_V8F16, int_amdgcn_wmma_f16_16x16x16_f16, VRegSrc_256, WMMAOpSel, 1>;
	defm V_WMMA_BF16_16X16X16_BF16 : WMMAInst<"_w64", "v_wmma_bf16_16x16x16_bf16", VOP_V8I16_V16I16_V16I16_V8I16, int_amdgcn_wmma_bf16_16x16x16_bf16, VRegSrc_256, WMMAOpSel>;			defm V_WMMA_BF16_16X16X16_BF16 : WMMAInst<"_w64", "v_wmma_bf16_16x16x16_bf16", VOP_V8I16_V16I16_V16I16_V8I16, int_amdgcn_wmma_bf16_16x16x16_bf16, VRegSrc_256, WMMAOpSel, 1>;
	defm V_WMMA_I32_16X16X16_IU8 : WMMAInst<"_w64", "v_wmma_i32_16x16x16_iu8", VOP_V4I32_V4I32_V4I32_V4I32, int_amdgcn_wmma_i32_16x16x16_iu8, VRegSrc_128, WMMAUIClamp>;			defm V_WMMA_TIED_F16_16X16X16_F16 : WMMAInst<"_w64", "v_wmma_f16_16x16x16_f16", VOP_V8F16_V16F16_V16F16_V8F16, int_amdgcn_wmma_tied_f16_16x16x16_f16, VRegSrc_256, WMMAOpSel, 0>;
	defm V_WMMA_I32_16X16X16_IU4 : WMMAInst<"_w64", "v_wmma_i32_16x16x16_iu4", VOP_V4I32_V2I32_V2I32_V4I32, int_amdgcn_wmma_i32_16x16x16_iu4, VRegSrc_64, WMMAUIClamp>;			defm V_WMMA_TIED_BF16_16X16X16_BF16 : WMMAInst<"_w64", "v_wmma_bf16_16x16x16_bf16", VOP_V8I16_V16I16_V16I16_V8I16, int_amdgcn_wmma_tied_bf16_16x16x16_bf16, VRegSrc_256, WMMAOpSel, 0>;
				defm V_WMMA_I32_16X16X16_IU8 : WMMAInst<"_w64", "v_wmma_i32_16x16x16_iu8", VOP_V4I32_V4I32_V4I32_V4I32, int_amdgcn_wmma_i32_16x16x16_iu8, VRegSrc_128, WMMAUIClamp, 1>;
				defm V_WMMA_I32_16X16X16_IU4 : WMMAInst<"_w64", "v_wmma_i32_16x16x16_iu4", VOP_V4I32_V2I32_V2I32_V4I32, int_amdgcn_wmma_i32_16x16x16_iu4, VRegSrc_64, WMMAUIClamp, 1>;

	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Begin Real Encodings			// Begin Real Encodings
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	class VOP3P_DPP16<bits<7> op, VOP_DPP_Pseudo ps, int subtarget,			class VOP3P_DPP16<bits<7> op, VOP_DPP_Pseudo ps, int subtarget,
	▲ Show 20 Lines • Show All 357 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_32.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize32,-wavefrontsize64 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=W32			; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize32,-wavefrontsize64 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=W32

	declare <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16(<16 x half>, <16 x half> , <8 x float>)			declare <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16(<16 x half>, <16 x half> , <8 x float>)
	declare <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16(<16 x i16>, <16 x i16> , <8 x float>)			declare <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16(<16 x i16>, <16 x i16> , <8 x float>)
	declare <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half>, <16 x half> , <16 x half>, i1 immarg)			declare <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half>, <16 x half> , <16 x half>, i1 immarg)
				declare <16 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half>, <16 x half> , <16 x half>, i1 immarg)
	declare <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16>, <16 x i16> , <16 x i16>, i1 immarg)			declare <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16>, <16 x i16> , <16 x i16>, i1 immarg)
				declare <16 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16>, <16 x i16> , <16 x i16>, i1 immarg)
	declare <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8(i1 immarg, <4 x i32>, i1 immarg, <4 x i32> , <8 x i32>, i1 immarg)			declare <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8(i1 immarg, <4 x i32>, i1 immarg, <4 x i32> , <8 x i32>, i1 immarg)
	declare <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4(i1 immarg, <2 x i32>, i1 immarg, <2 x i32> , <8 x i32>, i1 immarg)			declare <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4(i1 immarg, <2 x i32>, i1 immarg, <2 x i32> , <8 x i32>, i1 immarg)

	; @llvm.amdgcn.wmma.f32.16x16x16.f16			; @llvm.amdgcn.wmma.f32.16x16x16.f16

	define amdgpu_ps void @test_wmma_f32_16x16x16_f16(<16 x half> %A, <16 x half> %B, <8 x float> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_f32_16x16x16_f16(<16 x half> %A, <16 x half> %B, <8 x float> %C, ptr addrspace(1) %out) {
	; W32-LABEL: test_wmma_f32_16x16x16_f16:			; W32-LABEL: test_wmma_f32_16x16x16_f16:
	; W32: ; %bb.0: ; %bb			; W32: ; %bb.0: ; %bb
	▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; W32-NEXT: s_endpgm			; W32-NEXT: s_endpgm
	bb:			bb:
	%res = call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A, <16 x half> %B, <16 x half> %C, i1 1)			%res = call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A, <16 x half> %B, <16 x half> %C, i1 1)
	store <16 x half> %res, ptr addrspace(1) %out, align 32			store <16 x half> %res, ptr addrspace(1) %out, align 32
	ret void			ret void
	}			}

				define amdgpu_ps void @test_wmma_f16_16x16x16_f16_untied(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %A.1, <16 x half> %B.1, <16 x half> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W32-LABEL: test_wmma_f16_16x16x16_f16_untied:
				; W32: ; %bb.0: ; %bb
				; W32-NEXT: v_wmma_f16_16x16x16_f16 v[44:51], v[0:7], v[8:15], v[32:39]
				; W32-NEXT: v_wmma_f16_16x16x16_f16 v[32:39], v[16:23], v[24:31], v[32:39]
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[40:41], v[44:47], off
				; W32-NEXT: global_store_b128 v[40:41], v[48:51], off offset:16
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[42:43], v[32:35], off
				; W32-NEXT: global_store_b128 v[42:43], v[36:39], off offset:16
				; W32-NEXT: s_nop 0
				; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W32-NEXT: s_endpgm
				bb:
				%res.0 = call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %C, i1 0)
				%res.1 = call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A.1, <16 x half> %B.1, <16 x half> %C, i1 0)
				store <16 x half> %res.0, ptr addrspace(1) %out.0, align 32
				store <16 x half> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

				define amdgpu_ps void @test_wmma_f16_16x16x16_f16_tied(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %A.1, <16 x half> %B.1, <16 x half> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W32-LABEL: test_wmma_f16_16x16x16_f16_tied:
				; W32: ; %bb.0: ; %bb
				; W32-NEXT: v_dual_mov_b32 v51, v39 :: v_dual_mov_b32 v50, v38
				; W32-NEXT: v_dual_mov_b32 v49, v37 :: v_dual_mov_b32 v48, v36
				; W32-NEXT: v_dual_mov_b32 v47, v35 :: v_dual_mov_b32 v46, v34
				; W32-NEXT: v_dual_mov_b32 v45, v33 :: v_dual_mov_b32 v44, v32
				; W32-NEXT: v_wmma_f16_16x16x16_f16 v[32:39], v[16:23], v[24:31], v[32:39]
				; W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
				; W32-NEXT: v_wmma_f16_16x16x16_f16 v[44:51], v[0:7], v[8:15], v[44:51]
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[40:41], v[44:47], off
				; W32-NEXT: global_store_b128 v[40:41], v[48:51], off offset:16
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[42:43], v[32:35], off
				; W32-NEXT: global_store_b128 v[42:43], v[36:39], off offset:16
				; W32-NEXT: s_nop 0
				; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W32-NEXT: s_endpgm
				bb:
				%res.0 = call <16 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %C, i1 0)
				%res.1 = call <16 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half> %A.1, <16 x half> %B.1, <16 x half> %C, i1 0)
				store <16 x half> %res.0, ptr addrspace(1) %out.0, align 32
				store <16 x half> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

	; @llvm.amdgcn.wmma.bf16.16x16x16.bf16			; @llvm.amdgcn.wmma.bf16.16x16x16.bf16

	define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_lo(<16 x i16> %A, <16 x i16> %B, <16 x i16> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_lo(<16 x i16> %A, <16 x i16> %B, <16 x i16> %C, ptr addrspace(1) %out) {
	; W32-LABEL: test_wmma_bf16_16x16x16_bf16_lo:			; W32-LABEL: test_wmma_bf16_16x16x16_bf16_lo:
	; W32: ; %bb.0: ; %bb			; W32: ; %bb.0: ; %bb
	; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[16:23], v[0:7], v[8:15], v[16:23]			; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[16:23], v[0:7], v[8:15], v[16:23]
	; W32-NEXT: s_clause 0x1			; W32-NEXT: s_clause 0x1
	; W32-NEXT: global_store_b128 v[24:25], v[16:19], off			; W32-NEXT: global_store_b128 v[24:25], v[16:19], off
	Show All 18 Lines
	; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; W32-NEXT: s_endpgm			; W32-NEXT: s_endpgm
	bb:			bb:
	%res = call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A, <16 x i16> %B, <16 x i16> %C, i1 1)			%res = call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A, <16 x i16> %B, <16 x i16> %C, i1 1)
	store <16 x i16> %res, ptr addrspace(1) %out, align 32			store <16 x i16> %res, ptr addrspace(1) %out, align 32
	ret void			ret void
	}			}

				define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_untied(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %A.1, <16 x i16> %B.1, <16 x i16> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W32-LABEL: test_wmma_bf16_16x16x16_bf16_untied:
				; W32: ; %bb.0: ; %bb
				; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[44:51], v[0:7], v[8:15], v[32:39]
				; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[32:39], v[16:23], v[24:31], v[32:39]
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[40:41], v[44:47], off
				; W32-NEXT: global_store_b128 v[40:41], v[48:51], off offset:16
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[42:43], v[32:35], off
				; W32-NEXT: global_store_b128 v[42:43], v[36:39], off offset:16
				; W32-NEXT: s_nop 0
				; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W32-NEXT: s_endpgm
				bb:
				%res.0 = call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %C, i1 0)
				%res.1 = call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A.1, <16 x i16> %B.1, <16 x i16> %C, i1 0)
				store <16 x i16> %res.0, ptr addrspace(1) %out.0, align 32
				store <16 x i16> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

				define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_tied(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %A.1, <16 x i16> %B.1, <16 x i16> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W32-LABEL: test_wmma_bf16_16x16x16_bf16_tied:
				; W32: ; %bb.0: ; %bb
				; W32-NEXT: v_dual_mov_b32 v51, v39 :: v_dual_mov_b32 v50, v38
				; W32-NEXT: v_dual_mov_b32 v49, v37 :: v_dual_mov_b32 v48, v36
				; W32-NEXT: v_dual_mov_b32 v47, v35 :: v_dual_mov_b32 v46, v34
				; W32-NEXT: v_dual_mov_b32 v45, v33 :: v_dual_mov_b32 v44, v32
				; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[32:39], v[16:23], v[24:31], v[32:39]
				; W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
				; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[44:51], v[0:7], v[8:15], v[44:51]
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[40:41], v[44:47], off
				; W32-NEXT: global_store_b128 v[40:41], v[48:51], off offset:16
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[42:43], v[32:35], off
				; W32-NEXT: global_store_b128 v[42:43], v[36:39], off offset:16
				; W32-NEXT: s_nop 0
				; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W32-NEXT: s_endpgm
				bb:
				%res.0 = call <16 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %C, i1 0)
				%res.1 = call <16 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16> %A.1, <16 x i16> %B.1, <16 x i16> %C, i1 0)
				store <16 x i16> %res.0, ptr addrspace(1) %out.0, align 32
				store <16 x i16> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

	; @llvm.amdgcn.wmma.i32.16x16x16.iu8			; @llvm.amdgcn.wmma.i32.16x16x16.iu8

	define amdgpu_ps void @test_wmma_i32_16x16x16_ui8_unsigned_unsigned(<4 x i32> %A, <4 x i32> %B, <8 x i32> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_i32_16x16x16_ui8_unsigned_unsigned(<4 x i32> %A, <4 x i32> %B, <8 x i32> %C, ptr addrspace(1) %out) {
	; W32-LABEL: test_wmma_i32_16x16x16_ui8_unsigned_unsigned:			; W32-LABEL: test_wmma_i32_16x16x16_ui8_unsigned_unsigned:
	; W32: ; %bb.0: ; %bb			; W32: ; %bb.0: ; %bb
	; W32-NEXT: v_wmma_i32_16x16x16_iu8 v[8:15], v[0:3], v[4:7], v[8:15]			; W32-NEXT: v_wmma_i32_16x16x16_iu8 v[8:15], v[0:3], v[4:7], v[8:15]
	; W32-NEXT: s_clause 0x1			; W32-NEXT: s_clause 0x1
	; W32-NEXT: global_store_b128 v[16:17], v[8:11], off			; W32-NEXT: global_store_b128 v[16:17], v[8:11], off
	▲ Show 20 Lines • Show All 253 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wmma_64.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1100 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=W64			; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1100 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=W64

	declare <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16(<16 x half>, <16 x half>, <4 x float>)			declare <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16(<16 x half>, <16 x half>, <4 x float>)
	declare <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16(<16 x i16>, <16 x i16>, <4 x float>)			declare <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16(<16 x i16>, <16 x i16>, <4 x float>)
	declare <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half>, <16 x half>, <8 x half>, i1 immarg)			declare <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half>, <16 x half>, <8 x half>, i1 immarg)
				declare <8 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half>, <16 x half>, <8 x half>, i1 immarg)
	declare <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16>, <16 x i16>, <8 x i16>, i1 immarg)			declare <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16>, <16 x i16>, <8 x i16>, i1 immarg)
				declare <8 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16>, <16 x i16>, <8 x i16>, i1 immarg)
	declare <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8(i1 immarg, <4 x i32>, i1 immarg, <4 x i32>, <4 x i32>, i1 immarg)			declare <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8(i1 immarg, <4 x i32>, i1 immarg, <4 x i32>, <4 x i32>, i1 immarg)
	declare <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4(i1 immarg, <2 x i32>, i1 immarg, <2 x i32>, <4 x i32>, i1 immarg)			declare <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4(i1 immarg, <2 x i32>, i1 immarg, <2 x i32>, <4 x i32>, i1 immarg)

	; @llvm.amdgcn.wmma.f32.16x16x16.f16			; @llvm.amdgcn.wmma.f32.16x16x16.f16

	define amdgpu_ps void @test_wmma_f32_16x16x16_f16(<16 x half> %A, <16 x half> %B, <4 x float> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_f32_16x16x16_f16(<16 x half> %A, <16 x half> %B, <4 x float> %C, ptr addrspace(1) %out) {
	; W64-LABEL: test_wmma_f32_16x16x16_f16:			; W64-LABEL: test_wmma_f32_16x16x16_f16:
	; W64: ; %bb.0: ; %bb			; W64: ; %bb.0: ; %bb
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; W64-NEXT: s_endpgm			; W64-NEXT: s_endpgm
	bb:			bb:
	%res = call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A, <16 x half> %B, <8 x half> %C, i1 1)			%res = call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A, <16 x half> %B, <8 x half> %C, i1 1)
	store <8 x half> %res, ptr addrspace(1) %out, align 16			store <8 x half> %res, ptr addrspace(1) %out, align 16
	ret void			ret void
	}			}

				define amdgpu_ps void @test_wmma_f16_16x16x16_f16_untied(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %A.1, <16 x half> %B.1, <8 x half> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W64-LABEL: test_wmma_f16_16x16x16_f16_untied:
				; W64: ; %bb.0: ; %bb
				; W64-NEXT: v_wmma_f16_16x16x16_f16 v[40:43], v[0:7], v[8:15], v[32:35]
				; W64-NEXT: v_wmma_f16_16x16x16_f16 v[32:35], v[16:23], v[24:31], v[32:35]
				; W64-NEXT: global_store_b128 v[36:37], v[40:43], off
				; W64-NEXT: global_store_b128 v[38:39], v[32:35], off
				; W64-NEXT: s_nop 0
				; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W64-NEXT: s_endpgm
				bb:
				%res.0 = call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A.0, <16 x half> %B.0, <8 x half> %C, i1 0)
				%res.1 = call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A.1, <16 x half> %B.1, <8 x half> %C, i1 0)
				store <8 x half> %res.0, ptr addrspace(1) %out.0, align 32
				store <8 x half> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

				define amdgpu_ps void @test_wmma_f16_16x16x16_f16_tied(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %A.1, <16 x half> %B.1, <8 x half> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W64-LABEL: test_wmma_f16_16x16x16_f16_tied:
				; W64: ; %bb.0: ; %bb
				; W64-NEXT: v_mov_b32_e32 v43, v35
				; W64-NEXT: v_mov_b32_e32 v42, v34
				; W64-NEXT: v_mov_b32_e32 v41, v33
				; W64-NEXT: v_mov_b32_e32 v40, v32
				; W64-NEXT: v_wmma_f16_16x16x16_f16 v[32:35], v[16:23], v[24:31], v[32:35]
				; W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
				; W64-NEXT: v_wmma_f16_16x16x16_f16 v[40:43], v[0:7], v[8:15], v[40:43]
				; W64-NEXT: global_store_b128 v[36:37], v[40:43], off
				; W64-NEXT: global_store_b128 v[38:39], v[32:35], off
				; W64-NEXT: s_nop 0
				; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W64-NEXT: s_endpgm
				bb:
				%res.0 = call <8 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half> %A.0, <16 x half> %B.0, <8 x half> %C, i1 0)
				%res.1 = call <8 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half> %A.1, <16 x half> %B.1, <8 x half> %C, i1 0)
				store <8 x half> %res.0, ptr addrspace(1) %out.0, align 32
				store <8 x half> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

	; @llvm.amdgcn.wmma.bf16.16x16x16.bf16			; @llvm.amdgcn.wmma.bf16.16x16x16.bf16

	define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_lo(<16 x i16> %A, <16 x i16> %B, <8 x i16> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_lo(<16 x i16> %A, <16 x i16> %B, <8 x i16> %C, ptr addrspace(1) %out) {
	; W64-LABEL: test_wmma_bf16_16x16x16_bf16_lo:			; W64-LABEL: test_wmma_bf16_16x16x16_bf16_lo:
	; W64: ; %bb.0: ; %bb			; W64: ; %bb.0: ; %bb
	; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[16:19], v[0:7], v[8:15], v[16:19]			; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[16:19], v[0:7], v[8:15], v[16:19]
	; W64-NEXT: global_store_b128 v[20:21], v[16:19], off			; W64-NEXT: global_store_b128 v[20:21], v[16:19], off
	; W64-NEXT: s_nop 0			; W64-NEXT: s_nop 0
	Show All 14 Lines
	; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; W64-NEXT: s_endpgm			; W64-NEXT: s_endpgm
	bb:			bb:
	%res = call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A, <16 x i16> %B, <8 x i16> %C, i1 1)			%res = call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A, <16 x i16> %B, <8 x i16> %C, i1 1)
	store <8 x i16> %res, ptr addrspace(1) %out, align 16			store <8 x i16> %res, ptr addrspace(1) %out, align 16
	ret void			ret void
	}			}

				define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_untied(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %A.1, <16 x i16> %B.1, <8 x i16> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W64-LABEL: test_wmma_bf16_16x16x16_bf16_untied:
				; W64: ; %bb.0: ; %bb
				; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[40:43], v[0:7], v[8:15], v[32:35]
				; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[32:35], v[16:23], v[24:31], v[32:35]
				; W64-NEXT: global_store_b128 v[36:37], v[40:43], off
				; W64-NEXT: global_store_b128 v[38:39], v[32:35], off
				; W64-NEXT: s_nop 0
				; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W64-NEXT: s_endpgm
				bb:
				%res.0 = call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A.0, <16 x i16> %B.0, <8 x i16> %C, i1 0)
				%res.1 = call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A.1, <16 x i16> %B.1, <8 x i16> %C, i1 0)
				store <8 x i16> %res.0, ptr addrspace(1) %out.0, align 32
				store <8 x i16> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

				define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_tied(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %A.1, <16 x i16> %B.1, <8 x i16> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W64-LABEL: test_wmma_bf16_16x16x16_bf16_tied:
				; W64: ; %bb.0: ; %bb
				; W64-NEXT: v_mov_b32_e32 v43, v35
				; W64-NEXT: v_mov_b32_e32 v42, v34
				; W64-NEXT: v_mov_b32_e32 v41, v33
				; W64-NEXT: v_mov_b32_e32 v40, v32
				; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[32:35], v[16:23], v[24:31], v[32:35]
				; W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
				; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[40:43], v[0:7], v[8:15], v[40:43]
				; W64-NEXT: global_store_b128 v[36:37], v[40:43], off
				; W64-NEXT: global_store_b128 v[38:39], v[32:35], off
				; W64-NEXT: s_nop 0
				; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W64-NEXT: s_endpgm
				bb:
				%res.0 = call <8 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16> %A.0, <16 x i16> %B.0, <8 x i16> %C, i1 0)
				%res.1 = call <8 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16> %A.1, <16 x i16> %B.1, <8 x i16> %C, i1 0)
				store <8 x i16> %res.0, ptr addrspace(1) %out.0, align 32
				store <8 x i16> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

	; @llvm.amdgcn.wmma.i32.16x16x16.iu8			; @llvm.amdgcn.wmma.i32.16x16x16.iu8

	define amdgpu_ps void @test_wmma_i32_16x16x16_ui8_unsigned_unsigned(<4 x i32> %A, <4 x i32> %B, <4 x i32> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_i32_16x16x16_ui8_unsigned_unsigned(<4 x i32> %A, <4 x i32> %B, <4 x i32> %C, ptr addrspace(1) %out) {
	; W64-LABEL: test_wmma_i32_16x16x16_ui8_unsigned_unsigned:			; W64-LABEL: test_wmma_i32_16x16x16_ui8_unsigned_unsigned:
	; W64: ; %bb.0: ; %bb			; W64: ; %bb.0: ; %bb
	; W64-NEXT: v_wmma_i32_16x16x16_iu8 v[8:11], v[0:3], v[4:7], v[8:11]			; W64-NEXT: v_wmma_i32_16x16x16_iu8 v[8:11], v[0:3], v[4:7], v[8:11]
	; W64-NEXT: global_store_b128 v[12:13], v[8:11], off			; W64-NEXT: global_store_b128 v[12:13], v[8:11], off
	; W64-NEXT: s_nop 0			; W64-NEXT: s_nop 0
	▲ Show 20 Lines • Show All 221 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_32.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize32,-wavefrontsize64 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=W32			; RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=+wavefrontsize32,-wavefrontsize64 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=W32

	declare <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16(<16 x half>, <16 x half> , <8 x float>)			declare <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16(<16 x half>, <16 x half> , <8 x float>)
	declare <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16(<16 x i16>, <16 x i16> , <8 x float>)			declare <8 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16(<16 x i16>, <16 x i16> , <8 x float>)
	declare <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half>, <16 x half> , <16 x half>, i1 immarg)			declare <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half>, <16 x half> , <16 x half>, i1 immarg)
				declare <16 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half>, <16 x half> , <16 x half>, i1 immarg)
	declare <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16>, <16 x i16> , <16 x i16>, i1 immarg)			declare <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16>, <16 x i16> , <16 x i16>, i1 immarg)
				declare <16 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16>, <16 x i16> , <16 x i16>, i1 immarg)
	declare <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8(i1 immarg, <4 x i32>, i1 immarg, <4 x i32> , <8 x i32>, i1 immarg)			declare <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8(i1 immarg, <4 x i32>, i1 immarg, <4 x i32> , <8 x i32>, i1 immarg)
	declare <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4(i1 immarg, <2 x i32>, i1 immarg, <2 x i32> , <8 x i32>, i1 immarg)			declare <8 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4(i1 immarg, <2 x i32>, i1 immarg, <2 x i32> , <8 x i32>, i1 immarg)

	; @llvm.amdgcn.wmma.f32.16x16x16.f16			; @llvm.amdgcn.wmma.f32.16x16x16.f16

	define amdgpu_ps void @test_wmma_f32_16x16x16_f16(<16 x half> %A, <16 x half> %B, <8 x float> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_f32_16x16x16_f16(<16 x half> %A, <16 x half> %B, <8 x float> %C, ptr addrspace(1) %out) {
	; W32-LABEL: test_wmma_f32_16x16x16_f16:			; W32-LABEL: test_wmma_f32_16x16x16_f16:
	; W32: ; %bb.0: ; %bb			; W32: ; %bb.0: ; %bb
	▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; W32-NEXT: s_endpgm			; W32-NEXT: s_endpgm
	bb:			bb:
	%res = call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A, <16 x half> %B, <16 x half> %C, i1 1)			%res = call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A, <16 x half> %B, <16 x half> %C, i1 1)
	store <16 x half> %res, ptr addrspace(1) %out, align 32			store <16 x half> %res, ptr addrspace(1) %out, align 32
	ret void			ret void
	}			}

				define amdgpu_ps void @test_wmma_f16_16x16x16_f16_untied(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %A.1, <16 x half> %B.1, <16 x half> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W32-LABEL: test_wmma_f16_16x16x16_f16_untied:
				; W32: ; %bb.0: ; %bb
				; W32-NEXT: v_wmma_f16_16x16x16_f16 v[44:51], v[0:7], v[8:15], v[32:39]
				; W32-NEXT: v_wmma_f16_16x16x16_f16 v[32:39], v[16:23], v[24:31], v[32:39]
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[40:41], v[48:51], off offset:16
				; W32-NEXT: global_store_b128 v[40:41], v[44:47], off
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[42:43], v[36:39], off offset:16
				; W32-NEXT: global_store_b128 v[42:43], v[32:35], off
				; W32-NEXT: s_nop 0
				; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W32-NEXT: s_endpgm
				bb:
				%res.0 = call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %C, i1 0)
				%res.1 = call <16 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A.1, <16 x half> %B.1, <16 x half> %C, i1 0)
				store <16 x half> %res.0, ptr addrspace(1) %out.0, align 32
				store <16 x half> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

				define amdgpu_ps void @test_wmma_f16_16x16x16_f16_tied(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %A.1, <16 x half> %B.1, <16 x half> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W32-LABEL: test_wmma_f16_16x16x16_f16_tied:
				; W32: ; %bb.0: ; %bb
				; W32-NEXT: v_dual_mov_b32 v51, v39 :: v_dual_mov_b32 v50, v38
				; W32-NEXT: v_dual_mov_b32 v49, v37 :: v_dual_mov_b32 v48, v36
				; W32-NEXT: v_dual_mov_b32 v47, v35 :: v_dual_mov_b32 v46, v34
				; W32-NEXT: v_dual_mov_b32 v45, v33 :: v_dual_mov_b32 v44, v32
				; W32-NEXT: v_wmma_f16_16x16x16_f16 v[32:39], v[16:23], v[24:31], v[32:39]
				; W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
				; W32-NEXT: v_wmma_f16_16x16x16_f16 v[44:51], v[0:7], v[8:15], v[44:51]
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[40:41], v[48:51], off offset:16
				; W32-NEXT: global_store_b128 v[40:41], v[44:47], off
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[42:43], v[36:39], off offset:16
				; W32-NEXT: global_store_b128 v[42:43], v[32:35], off
				; W32-NEXT: s_nop 0
				; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W32-NEXT: s_endpgm
				bb:
				%res.0 = call <16 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %C, i1 0)
				%res.1 = call <16 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half> %A.1, <16 x half> %B.1, <16 x half> %C, i1 0)
				store <16 x half> %res.0, ptr addrspace(1) %out.0, align 32
				store <16 x half> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

	; @llvm.amdgcn.wmma.bf16.16x16x16.bf16			; @llvm.amdgcn.wmma.bf16.16x16x16.bf16

	define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_lo(<16 x i16> %A, <16 x i16> %B, <16 x i16> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_lo(<16 x i16> %A, <16 x i16> %B, <16 x i16> %C, ptr addrspace(1) %out) {
	; W32-LABEL: test_wmma_bf16_16x16x16_bf16_lo:			; W32-LABEL: test_wmma_bf16_16x16x16_bf16_lo:
	; W32: ; %bb.0: ; %bb			; W32: ; %bb.0: ; %bb
	; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[16:23], v[0:7], v[8:15], v[16:23]			; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[16:23], v[0:7], v[8:15], v[16:23]
	; W32-NEXT: s_clause 0x1			; W32-NEXT: s_clause 0x1
	; W32-NEXT: global_store_b128 v[24:25], v[20:23], off offset:16			; W32-NEXT: global_store_b128 v[24:25], v[20:23], off offset:16
	Show All 18 Lines
	; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; W32-NEXT: s_endpgm			; W32-NEXT: s_endpgm
	bb:			bb:
	%res = call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A, <16 x i16> %B, <16 x i16> %C, i1 1)			%res = call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A, <16 x i16> %B, <16 x i16> %C, i1 1)
	store <16 x i16> %res, ptr addrspace(1) %out, align 32			store <16 x i16> %res, ptr addrspace(1) %out, align 32
	ret void			ret void
	}			}

				define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_untied(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %A.1, <16 x i16> %B.1, <16 x i16> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W32-LABEL: test_wmma_bf16_16x16x16_bf16_untied:
				; W32: ; %bb.0: ; %bb
				; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[44:51], v[0:7], v[8:15], v[32:39]
				; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[32:39], v[16:23], v[24:31], v[32:39]
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[40:41], v[48:51], off offset:16
				; W32-NEXT: global_store_b128 v[40:41], v[44:47], off
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[42:43], v[36:39], off offset:16
				; W32-NEXT: global_store_b128 v[42:43], v[32:35], off
				; W32-NEXT: s_nop 0
				; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W32-NEXT: s_endpgm
				bb:
				%res.0 = call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %C, i1 0)
				%res.1 = call <16 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A.1, <16 x i16> %B.1, <16 x i16> %C, i1 0)
				store <16 x i16> %res.0, ptr addrspace(1) %out.0, align 32
				store <16 x i16> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

				define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_tied(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %A.1, <16 x i16> %B.1, <16 x i16> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W32-LABEL: test_wmma_bf16_16x16x16_bf16_tied:
				; W32: ; %bb.0: ; %bb
				; W32-NEXT: v_dual_mov_b32 v51, v39 :: v_dual_mov_b32 v50, v38
				; W32-NEXT: v_dual_mov_b32 v49, v37 :: v_dual_mov_b32 v48, v36
				; W32-NEXT: v_dual_mov_b32 v47, v35 :: v_dual_mov_b32 v46, v34
				; W32-NEXT: v_dual_mov_b32 v45, v33 :: v_dual_mov_b32 v44, v32
				; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[32:39], v[16:23], v[24:31], v[32:39]
				; W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
				; W32-NEXT: v_wmma_bf16_16x16x16_bf16 v[44:51], v[0:7], v[8:15], v[44:51]
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[40:41], v[48:51], off offset:16
				; W32-NEXT: global_store_b128 v[40:41], v[44:47], off
				; W32-NEXT: s_clause 0x1
				; W32-NEXT: global_store_b128 v[42:43], v[36:39], off offset:16
				; W32-NEXT: global_store_b128 v[42:43], v[32:35], off
				; W32-NEXT: s_nop 0
				; W32-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W32-NEXT: s_endpgm
				bb:
				%res.0 = call <16 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %C, i1 0)
				%res.1 = call <16 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16> %A.1, <16 x i16> %B.1, <16 x i16> %C, i1 0)
				store <16 x i16> %res.0, ptr addrspace(1) %out.0, align 32
				store <16 x i16> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

	; @llvm.amdgcn.wmma.i32.16x16x16.iu8			; @llvm.amdgcn.wmma.i32.16x16x16.iu8

	define amdgpu_ps void @test_wmma_i32_16x16x16_ui8_unsigned_unsigned(<4 x i32> %A, <4 x i32> %B, <8 x i32> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_i32_16x16x16_ui8_unsigned_unsigned(<4 x i32> %A, <4 x i32> %B, <8 x i32> %C, ptr addrspace(1) %out) {
	; W32-LABEL: test_wmma_i32_16x16x16_ui8_unsigned_unsigned:			; W32-LABEL: test_wmma_i32_16x16x16_ui8_unsigned_unsigned:
	; W32: ; %bb.0: ; %bb			; W32: ; %bb.0: ; %bb
	; W32-NEXT: v_wmma_i32_16x16x16_iu8 v[8:15], v[0:3], v[4:7], v[8:15]			; W32-NEXT: v_wmma_i32_16x16x16_iu8 v[8:15], v[0:3], v[4:7], v[8:15]
	; W32-NEXT: s_clause 0x1			; W32-NEXT: s_clause 0x1
	; W32-NEXT: global_store_b128 v[16:17], v[12:15], off offset:16			; W32-NEXT: global_store_b128 v[16:17], v[12:15], off offset:16
	▲ Show 20 Lines • Show All 253 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wmma_64.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=W64			; RUN: llc -march=amdgcn -mcpu=gfx1100 -mattr=-wavefrontsize32,+wavefrontsize64 -verify-machineinstrs < %s \| FileCheck %s --check-prefix=W64

	declare <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16(<16 x half>, <16 x half>, <4 x float>)			declare <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.f16(<16 x half>, <16 x half>, <4 x float>)
	declare <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16(<16 x i16>, <16 x i16>, <4 x float>)			declare <4 x float> @llvm.amdgcn.wmma.f32.16x16x16.bf16(<16 x i16>, <16 x i16>, <4 x float>)
	declare <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half>, <16 x half>, <8 x half>, i1 immarg)			declare <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half>, <16 x half>, <8 x half>, i1 immarg)
				declare <8 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half>, <16 x half>, <8 x half>, i1 immarg)
	declare <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16>, <16 x i16>, <8 x i16>, i1 immarg)			declare <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16>, <16 x i16>, <8 x i16>, i1 immarg)
				declare <8 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16>, <16 x i16>, <8 x i16>, i1 immarg)
	declare <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8(i1 immarg, <4 x i32>, i1 immarg, <4 x i32>, <4 x i32>, i1 immarg)			declare <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu8(i1 immarg, <4 x i32>, i1 immarg, <4 x i32>, <4 x i32>, i1 immarg)
	declare <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4(i1 immarg, <2 x i32>, i1 immarg, <2 x i32>, <4 x i32>, i1 immarg)			declare <4 x i32> @llvm.amdgcn.wmma.i32.16x16x16.iu4(i1 immarg, <2 x i32>, i1 immarg, <2 x i32>, <4 x i32>, i1 immarg)

	; @llvm.amdgcn.wmma.f32.16x16x16.f16			; @llvm.amdgcn.wmma.f32.16x16x16.f16

	define amdgpu_ps void @test_wmma_f32_16x16x16_f16(<16 x half> %A, <16 x half> %B, <4 x float> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_f32_16x16x16_f16(<16 x half> %A, <16 x half> %B, <4 x float> %C, ptr addrspace(1) %out) {
	; W64-LABEL: test_wmma_f32_16x16x16_f16:			; W64-LABEL: test_wmma_f32_16x16x16_f16:
	; W64: ; %bb.0: ; %bb			; W64: ; %bb.0: ; %bb
	▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; W64-NEXT: s_endpgm			; W64-NEXT: s_endpgm
	bb:			bb:
	%res = call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A, <16 x half> %B, <8 x half> %C, i1 1)			%res = call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A, <16 x half> %B, <8 x half> %C, i1 1)
	store <8 x half> %res, ptr addrspace(1) %out, align 16			store <8 x half> %res, ptr addrspace(1) %out, align 16
	ret void			ret void
	}			}

				define amdgpu_ps void @test_wmma_f16_16x16x16_f16_untied(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %A.1, <16 x half> %B.1, <8 x half> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W64-LABEL: test_wmma_f16_16x16x16_f16_untied:
				; W64: ; %bb.0: ; %bb
				; W64-NEXT: v_wmma_f16_16x16x16_f16 v[40:43], v[0:7], v[8:15], v[32:35]
				; W64-NEXT: v_wmma_f16_16x16x16_f16 v[32:35], v[16:23], v[24:31], v[32:35]
				; W64-NEXT: global_store_b128 v[36:37], v[40:43], off
				; W64-NEXT: global_store_b128 v[38:39], v[32:35], off
				; W64-NEXT: s_nop 0
				; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W64-NEXT: s_endpgm
				bb:
				%res.0 = call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A.0, <16 x half> %B.0, <8 x half> %C, i1 0)
				%res.1 = call <8 x half> @llvm.amdgcn.wmma.f16.16x16x16.f16(<16 x half> %A.1, <16 x half> %B.1, <8 x half> %C, i1 0)
				store <8 x half> %res.0, ptr addrspace(1) %out.0, align 32
				store <8 x half> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

				define amdgpu_ps void @test_wmma_f16_16x16x16_f16_tied(<16 x half> %A.0, <16 x half> %B.0, <16 x half> %A.1, <16 x half> %B.1, <8 x half> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W64-LABEL: test_wmma_f16_16x16x16_f16_tied:
				; W64: ; %bb.0: ; %bb
				; W64-NEXT: v_mov_b32_e32 v43, v35
				; W64-NEXT: v_mov_b32_e32 v42, v34
				; W64-NEXT: v_mov_b32_e32 v41, v33
				; W64-NEXT: v_mov_b32_e32 v40, v32
				; W64-NEXT: v_wmma_f16_16x16x16_f16 v[32:35], v[16:23], v[24:31], v[32:35]
				; W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
				; W64-NEXT: v_wmma_f16_16x16x16_f16 v[40:43], v[0:7], v[8:15], v[40:43]
				; W64-NEXT: global_store_b128 v[36:37], v[40:43], off
				; W64-NEXT: global_store_b128 v[38:39], v[32:35], off
				; W64-NEXT: s_nop 0
				; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W64-NEXT: s_endpgm
				bb:
				%res.0 = call <8 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half> %A.0, <16 x half> %B.0, <8 x half> %C, i1 0)
				%res.1 = call <8 x half> @llvm.amdgcn.wmma.tied.f16.16x16x16.f16(<16 x half> %A.1, <16 x half> %B.1, <8 x half> %C, i1 0)
				store <8 x half> %res.0, ptr addrspace(1) %out.0, align 32
				store <8 x half> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

	; @llvm.amdgcn.wmma.bf16.16x16x16.bf16			; @llvm.amdgcn.wmma.bf16.16x16x16.bf16

	define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_lo(<16 x i16> %A, <16 x i16> %B, <8 x i16> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_lo(<16 x i16> %A, <16 x i16> %B, <8 x i16> %C, ptr addrspace(1) %out) {
	; W64-LABEL: test_wmma_bf16_16x16x16_bf16_lo:			; W64-LABEL: test_wmma_bf16_16x16x16_bf16_lo:
	; W64: ; %bb.0: ; %bb			; W64: ; %bb.0: ; %bb
	; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[16:19], v[0:7], v[8:15], v[16:19]			; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[16:19], v[0:7], v[8:15], v[16:19]
	; W64-NEXT: global_store_b128 v[20:21], v[16:19], off			; W64-NEXT: global_store_b128 v[20:21], v[16:19], off
	; W64-NEXT: s_nop 0			; W64-NEXT: s_nop 0
	Show All 14 Lines
	; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; W64-NEXT: s_endpgm			; W64-NEXT: s_endpgm
	bb:			bb:
	%res = call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A, <16 x i16> %B, <8 x i16> %C, i1 1)			%res = call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A, <16 x i16> %B, <8 x i16> %C, i1 1)
	store <8 x i16> %res, ptr addrspace(1) %out, align 16			store <8 x i16> %res, ptr addrspace(1) %out, align 16
	ret void			ret void
	}			}

				define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_untied(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %A.1, <16 x i16> %B.1, <8 x i16> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W64-LABEL: test_wmma_bf16_16x16x16_bf16_untied:
				; W64: ; %bb.0: ; %bb
				; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[40:43], v[0:7], v[8:15], v[32:35]
				; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[32:35], v[16:23], v[24:31], v[32:35]
				; W64-NEXT: global_store_b128 v[36:37], v[40:43], off
				; W64-NEXT: global_store_b128 v[38:39], v[32:35], off
				; W64-NEXT: s_nop 0
				; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W64-NEXT: s_endpgm
				bb:
				%res.0 = call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A.0, <16 x i16> %B.0, <8 x i16> %C, i1 0)
				%res.1 = call <8 x i16> @llvm.amdgcn.wmma.bf16.16x16x16.bf16(<16 x i16> %A.1, <16 x i16> %B.1, <8 x i16> %C, i1 0)
				store <8 x i16> %res.0, ptr addrspace(1) %out.0, align 32
				store <8 x i16> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

				define amdgpu_ps void @test_wmma_bf16_16x16x16_bf16_tied(<16 x i16> %A.0, <16 x i16> %B.0, <16 x i16> %A.1, <16 x i16> %B.1, <8 x i16> %C, ptr addrspace(1) %out.0, ptr addrspace(1) %out.1) {
				; W64-LABEL: test_wmma_bf16_16x16x16_bf16_tied:
				; W64: ; %bb.0: ; %bb
				; W64-NEXT: v_mov_b32_e32 v43, v35
				; W64-NEXT: v_mov_b32_e32 v42, v34
				; W64-NEXT: v_mov_b32_e32 v41, v33
				; W64-NEXT: v_mov_b32_e32 v40, v32
				; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[32:35], v[16:23], v[24:31], v[32:35]
				; W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
				; W64-NEXT: v_wmma_bf16_16x16x16_bf16 v[40:43], v[0:7], v[8:15], v[40:43]
				; W64-NEXT: global_store_b128 v[36:37], v[40:43], off
				; W64-NEXT: global_store_b128 v[38:39], v[32:35], off
				; W64-NEXT: s_nop 0
				; W64-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
				; W64-NEXT: s_endpgm
				bb:
				%res.0 = call <8 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16> %A.0, <16 x i16> %B.0, <8 x i16> %C, i1 0)
				%res.1 = call <8 x i16> @llvm.amdgcn.wmma.tied.bf16.16x16x16.bf16(<16 x i16> %A.1, <16 x i16> %B.1, <8 x i16> %C, i1 0)
				store <8 x i16> %res.0, ptr addrspace(1) %out.0, align 32
				store <8 x i16> %res.1, ptr addrspace(1) %out.1, align 32
				ret void
				}

	; @llvm.amdgcn.wmma.i32.16x16x16.iu8			; @llvm.amdgcn.wmma.i32.16x16x16.iu8

	define amdgpu_ps void @test_wmma_i32_16x16x16_ui8_unsigned_unsigned(<4 x i32> %A, <4 x i32> %B, <4 x i32> %C, ptr addrspace(1) %out) {			define amdgpu_ps void @test_wmma_i32_16x16x16_ui8_unsigned_unsigned(<4 x i32> %A, <4 x i32> %B, <4 x i32> %C, ptr addrspace(1) %out) {
	; W64-LABEL: test_wmma_i32_16x16x16_ui8_unsigned_unsigned:			; W64-LABEL: test_wmma_i32_16x16x16_ui8_unsigned_unsigned:
	; W64: ; %bb.0: ; %bb			; W64: ; %bb.0: ; %bb
	; W64-NEXT: v_wmma_i32_16x16x16_iu8 v[8:11], v[0:3], v[4:7], v[8:11]			; W64-NEXT: v_wmma_i32_16x16x16_iu8 v[8:11], v[0:3], v[4:7], v[8:11]
	; W64-NEXT: global_store_b128 v[12:13], v[8:11], off			; W64-NEXT: global_store_b128 v[12:13], v[8:11], off
	; W64-NEXT: s_nop 0			; W64-NEXT: s_nop 0
	▲ Show 20 Lines • Show All 221 Lines • Show Last 20 Lines