This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU/wmma] - Disable 3-address syntax for f16
Needs RevisionPublic

Authored by OutOfCache on Aug 16 2023, 2:14 AM.

Details

Reviewers
piotr
nhaehnle
arsenm
Group Reviewers
Restricted Project
Summary

Always keep wmma instructions with a 16-bit floating-point accumulator
as two-address instruction.

This is a prerequesite for an upcoming optimization for wmma with 16-bit
accumulator matrices.
We want to pack the results of two separate
wmmas into the same register, so one matrix
is in the lower half while the other matrix is
in the upper half of the registers.

We pack the values into the registers before using them
in the first wmma as input:

v_wmma_f16_16x16x16_f16 v[0:7], v[8:15], v[16:23], v[0:7]
v_wmma_f16_16x16x16_f16 v[0:7], v[24:31], v[32:49], v[0:7] op_sel:[0,0,1]

Therefore, both instructions need to write to the same registers
and overwrite the values of the input matrices.

We have verified the correct behavior by
running nod.ai's Stable Diffusion with these
changes in data layout.
On average, this change reduced the vgpr count by 17.17% (in 88 shaders
that the change applied to).

Diff Detail

Event Timeline

OutOfCache created this revision.Aug 16 2023, 2:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2023, 2:14 AM
OutOfCache requested review of this revision.Aug 16 2023, 2:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2023, 2:14 AM
arsenm added inline comments.Aug 16 2023, 4:32 AM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
874–875

Shouldn’t lie about properties, disable at a different point?

OutOfCache added inline comments.Aug 16 2023, 6:40 AM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
874–875

Where would you suggest?

arsenm added inline comments.Aug 17 2023, 3:33 PM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
874–875

Does this happen in two address instructions? I assume the current heuristic assumes a simple register and isn’t accounting for large tuple increasing pressure

OutOfCache edited the summary of this revision. (Show Details)Aug 18 2023, 1:47 AM
OutOfCache added inline comments.Aug 18 2023, 1:54 AM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
874–875

Sorry, what do you mean by 'does this happen'? If you are talking about the conversion from two address to three address instruction, then yes, it can happen. I encountered issues when compiling Stable Diffusion shaders that initialize the matrices via the zeroinitializer constant outside a loop.

In the loop case, each matrix receives a dedicated zero matrix. Without the loop, the same zero matrix is reused for multiple wmmas. So we have the following scenario:

v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[24:31]
v_wmma_f16_16x16x16_f16 v[32:30], ..., v[24:31]

So we use v[24:31 as zero matrix as input, but a different destination matrix (e.g., v[0:7]). The problem arises once we try to pack:

v_wmma_f16_16x16x16_f16 v[0:7],     ..., v[0:7] op_sel [0,0,1]

We have no guarantee that the upper halves of v[0:7] are zero initialized. In practice, this indeed caused issues (completely black images as SD output). This change fixed it.

You are right that disabling the ability to convert is not the best solution, and I would be happy to move this change elsewhere. Currently I don't know a better place, though. Can we maybe adapt the heuristics to always keep the two address mode for constant input matrices like the zeroinitializer? Or does that not make sense either?

Updated test

piotr added a comment.Aug 25 2023, 6:46 AM

I am trying to understand the failing case better. Can the issue only happen with the extra patch with packing? Is the issue only with zeroinitializers (constant matrices), or it is just where the problem was found?

It would be preferable to keep the 3-address syntax, as it can produce better code (see the extra moves in the updated ir tests).

Sorry for responding so late.

The problem was only found with zeroinitializer matrices, which were reused as input for multiple wmma_f16 instructions. However, it could happen for other, non-constant matrices as well, as long as the input and output accumulator registers are different.

After further discussion with other compiler engineers, we want to add a new pseudo instruction for the tied instruction. Then we can update the intrinsics in the packing patch to lower to these specific pseudos. That way, the original wmma_f16 instruction can still be a three-address instruction in cases outside the patch.

  • [AMDGPU/VOP3P] - Simplify wmma instruction defs
  • [AMDGPU/VOP3P] - Add tied wmma_f16 pseudos

Use new pseudo instructions instead of modifying existing ones.

piotr added a comment.Sep 7 2023, 3:24 AM

Thanks, that would avoid the regression. However, I still do not fully understand the failing mode - can you show the test case + extra code that triggers the issue?

foad added a reviewer: Restricted Project.Sep 7 2023, 3:34 AM

The patch as-is needs some codegen tests for the new intrinsics. (I have no opinion on whether adding new intrinsics is actually the right approach here.)

OutOfCache updated this revision to Diff 556247.Sep 8 2023, 5:15 AM

Added codegen tests for new intrinsics

Thanks, that would avoid the regression. However, I still do not fully understand the failing mode - can you show the test case + extra code that triggers the issue?

I admit, it is probably hard to follow without the full context. I'll try again.

During cooperative matrix calculation, we typically have a loop. This loop calculates one or more accumulator matrices.
Every iteration calculates the partial result of the accumulator, while iterating over different factor matrices.

However, some shaders do not use a loop.

Suppose we have multiple wmma instructions. They all use the same C matrix. In our case, they all use a zero matrix as input at first, because there is no previous result.

v_mov v24, 0
; ...
; v[24:31] has all zeros
v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[24:31]
v_wmma_f16_16x16x16_f16 v[32:39], ..., v[24:31]

Since these are the wmma_f16 instructions, they write the result into the lower 16 bit of the result registers.
Therefore, after the instructions, the content of v0 is 0xIJKL????, where IJKL is the result of the wmma, while the ???? is the previous content of v0, from before the wmma.
In other words, even though the input accumulator has all zeros in its registers, these zeros are not copied into the result registers.

Then, we have other wmmas, which update the result matrices. We take the previous result as input accumulator, and swap out the factor matrices for different ones.

v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[0:7]
v_wmma_f16_16x16x16_f16 v[32:39], ..., v[32:39]

After that, the content of v0 is 0xPQRS????.

Usually, this is not an issue. But the future patch tries to fill the ???? with the values of another matrix, so we can save VGPRs.
We can do that thanks to the op_sel argument. We want the matrix values inside v[32:39] to be inside v[0:7].
Essentally, we want this:

v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[0:7]
v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[0:7] op_sel [0, 0, 1] ; instead of v[32:39]

The second instructions reads from the upper 16 bit of v0, and also writes the result into the upper 16 bit of v0.
That normally does not cause an issue. However, when we intially calculated v0 with the first wmma, we had the content 0xIJKL????.
Now, any wmma with the op_sel tries to read the upper sixteen bit, the ???? as input accumulator. These upper bits were not initialized to zeros.

That is the root of the issue, since now the previous content of the registers is added to the wmma result, and not 0 as expected.

Tying the input accumulator to the result accumulator solves this issue. We no longer have a separate zero matrix, which we reuse over multiple wmmas.
Instead, we correctly initialize the output matrix to all zeros and then calculate the results.

v_mov v0, 0
; ...
v_wmma_f16_16x16x16_f16 v[0:7],   ..., v[0:7]

After that, the content of v0 is now 0xIJKL0000, as expected.

Does that clear things up? Let me know if I skipped some detail.

arsenm requested changes to this revision.Sep 8 2023, 7:33 AM

This shouldn't call for introducing new intrinsics

This revision now requires changes to proceed.Sep 8 2023, 7:33 AM
arsenm added a comment.Sep 8 2023, 7:39 AM

Is the problem the rematerialization doesn't understand the liveness with op_sel?

I'd be happy to change the approach, but I can't think of a better way to preserve the old behavior while also guaranteeing the correct initialization of register values.
I assume the current behavior of wmma (only writing to one half of the register while leaving the other half untouched) is correct, or should it copy the content of the input accumulator into the other half?

I’ll take a try at explaining what the new tied-intrinsic does and where it is useful:

Our frontend sees matrix multiplications of multiple, different 16-bit matrices. Each of these matrices takes 8 VGPRs. However, a 16-bit matrix only uses either the high or the low half of each of the 8 VGPRs.
So, what our compiler tries to do, is merging a pair of two independent 16-bit matrices into 8 VGPRs. One matrix taking the low half of each register, one matrix taking the high half.

In IR, we have intrinsics that work on such a combined matrix-pair:

%combined = call <8 x float> @combine_halfs(<8 x float> %lo_16_bit_matrix, <8 x float> %hi_16_bit_matrix)  ; type is like <16 x half> with every second half in use
%low_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 false /* low half */)
%high_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %low_half_multiplied, i1 true /* high half */)

So far so good, we divide VGPR usage by two, now we need to lower our @wmma intrinsic to llvm.amdgcn intrinsics.
For this, we need a wmma intrinsic that uses the low or high half of our value as accumulator (c matrix) and preserves the other half of the value.
If we do not preserve the other half, we would loose the second part of the packed matrix.

This is where the new tied intrinsic comes into play, it guarantees to preserve the “untouched” part of the value by tying input and output to the same physical VGPRs.

Thanks for the extra info; I understand the problem now: currently there seems to be no way to take advantage of the opsel bit to reuse the same destination matrix registers for two wmma instructions.

One way to fix that would be as proposed here, at the intrinsic level. In this approach the intrinsic would always take the full register (C, D matrices), operate on the specified half, and preserve the other half.

But maybe this can be fixed in the codegen without adding the new intrinsic? I think the underlying problem is that we are not modelling correctly the fact that these opsel instructions do not touch the other half. If we could do that I would expect the two address pass to re-use the dest regs of the first instruction, because the low halves would still be live at the end of the second instruction.

%combined = call <8 x float> @combine_halfs(<8 x float> %lo_16_bit_matrix, <8 x float> %hi_16_bit_matrix)
%low_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 false /* low half */)
%high_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 true /* high half */)

That's true @piotr, that is the underlying issue. The twoaddress pass needs to somehow notice this behavior. It makes sense to tackle this problem at the root, so I will look into the current decision process in the pass and think about solutions.

I am open to hear any suggestions. Has there been a case similar to this before?

Thanks for the extra info; I understand the problem now: currently there seems to be no way to take advantage of the opsel bit to reuse the same destination matrix registers for two wmma instructions.

One way to fix that would be as proposed here, at the intrinsic level. In this approach the intrinsic would always take the full register (C, D matrices), operate on the specified half, and preserve the other half.

But maybe this can be fixed in the codegen without adding the new intrinsic? I think the underlying problem is that we are not modelling correctly the fact that these opsel instructions do not touch the other half. If we could do that I would expect the two address pass to re-use the dest regs of the first instruction, because the low halves would still be live at the end of the second instruction.

So, if you have multiple packed operations in a row, you would need to “re-combine” the matrices?
I.e.

%combined = call <8 x float> @combine_halfs(<8 x float> %lo_16_bit_matrix, <8 x float> %hi_16_bit_matrix)
%low_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 false /* low half */)
%high_half_multiplied = call <8 x float> @wmma(%a, %b, <8 x float> %combined, i1 true /* high half */)
%combined2 = call <8 x float> @combine_halfs(<8 x float> %low_half_multiplied, <8 x float> %high_half_multiplied)
%low_half_multiplied2 = call <8 x float> @wmma(%a, %b, <8 x float> %combined2, i1 false /* low half */)
%high_half_multiplied2 = call <8 x float> @wmma(%a, %b, <8 x float> %combined2, i1 true /* high half */)

Then, as you say, our register allocation needs to be intelligent enough to keep the matrices packed.
How would you define the instructions for this to work?

piotr added a comment.Sep 18 2023, 7:53 AM

Then, as you say, our register allocation needs to be intelligent enough to keep the matrices packed.
How would you define the instructions for this to work?

Unfortunately, looking at that a bit more I don't think the scheme I proposed is feasible. Even if we add some extra copies to preserve the other half, the twoaddressinstruction pass will not be able to understand that.

The only alternative I could suggest instead of adding new intrinsics seems to be to implement the packing entirely in the codegen (e.g. after twoaddressinstruction pass).

Then, as you say, our register allocation needs to be intelligent enough to keep the matrices packed.
How would you define the instructions for this to work?

Unfortunately, looking at that a bit more I don't think the scheme I proposed is feasible. Even if we add some extra copies to preserve the other half, the twoaddressinstruction pass will not be able to understand that.

The only alternative I could suggest instead of adding new intrinsics seems to be to implement the packing entirely in the codegen (e.g. after twoaddressinstruction pass).

Thank you for looking into it! I don't know if the packing is feasible at that stage, though. The packing affects the users of the matrices as well, and we base our solution heavily on finding the users of the lgc intrinisics. Plus, the changes can remove some intrinsic calls entirely, so the code size is reduced in any further pass. Delaying the changes to such a late stage makes the packing process more complicated and we would keep more code in each pass, which might be removed later on.