This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Reduce AGPR to AGPR copies with same source
AbandonedPublic

Authored by vangthao on Jun 28 2021, 1:57 PM.

Details

Summary

On targets where direct AGPR to AGPR copy is not possible, the postrapseudos
pass tries to expand AGPR to AGPR copies by first trying to find the defining
accvgpr_write instruction and propagating its vgpr register to avoid creating
temporary registers. In cases where we cannot propagate, a new temporary VGPR
register will be created for each AGPR copy instruction even if the sources
are the same for all or some of the copy instructions.

With this patch, after failing to propagate the defining instruction's VGPR
register, we will next try to see if another copy instruction has already
inserted in a temporay VGPR copy with accvgpr_read and attempt to propagate
that copy.

Diff Detail

Event Timeline

vangthao created this revision.Jun 28 2021, 1:57 PM
vangthao requested review of this revision.Jun 28 2021, 1:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2021, 1:57 PM
arsenm requested changes to this revision.Jun 28 2021, 5:18 PM

I think copyPhysReg should not have to do any nontrivial scanning. These situations where this matters should be resolvable before register allocation

This revision now requires changes to proceed.Jun 28 2021, 5:18 PM

I think copyPhysReg should not have to do any nontrivial scanning. These situations where this matters should be resolvable before register allocation

I concur. This is untrivial and error prone. There are quite a number of foldings for agpr copy and init in different places, this looks like a candidate for yet another one, starting from a relevant testcase. This place is really a last resort.

Do you have a minimal ir test showing a problem you are solving?

I think copyPhysReg should not have to do any nontrivial scanning. These situations where this matters should be resolvable before register allocation

I concur. This is untrivial and error prone. There are quite a number of foldings for agpr copy and init in different places, this looks like a candidate for yet another one, starting from a relevant testcase. This place is really a last resort.

Do you have a minimal ir test showing a problem you are solving?

I added a test case broadcast_a32_unable_to_propagate_v0 in accvgpr-copy.mir where we failed to propagate v0 when broadcasting $agpr0 to 31 other lanes. Using the trunk, this forces us to read $agpr0 31 times, write it to a vgpr register, and then copy it to the agpr lane. This issue occurs when an instruction is scheduled in-between the defining accvgpr_write and the COPIES and also writes over the same vgpr register as the defining accvgpr_write. In this case, a V_MOV_B32_e32 uses $vgpr0 and thus we cannot propagate $vgpr0 to the AGPR copies.

I believe this is really only an issue when we have a situation where multiple lanes are using the same source. Following the IR printed by LLVM before register allocation, the code looks somewhat like this

%80:vgpr_32 = COPY %12:sgpr_32
%81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
%82:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
%83:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
%84:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
%85:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
...
%70:areg_1024 = REG_SEQUENCE %81:agpr_32, %subreg.sub0, %82:agpr_32 ...

This seems fine to me. We are using the vgpr register instead of an agpr register before register allocation. But then the machine-cse pass eliminates this chain of instruction and creates a dependency on agpr register %81.

%80:vgpr_32 = COPY %12:sgpr_32
%81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
%70:areg_1024 = REG_SEQUENCE %81:agpr_32, %subreg.sub0, %81:agpr_32, %subreg.sub1, %81:agpr_32 ...

This is an issue because the REG_SEQUENCE will be expanded into multiple AGPR copies during twoaddressinstruction pass.

%80:vgpr_32 = V_MOV_B32_e32 1123418112, implicit $exec
%81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 killed %80:vgpr_32, implicit $exec
undef %70.sub0:areg_1024 = COPY %81:agpr_32
%70.sub1:areg_1024 = COPY %81:agpr_32
%70.sub2:areg_1024 = COPY %81:agpr_32
%70.sub3:areg_1024 = COPY %81:agpr_32
...

So we just opted to eliminate instructions and replaced them with AGPR copies that we will have to transform back into those same instructions we eliminated. Even worse if we happen to not be able to propagate %80:vgpr_32 as in the test scenario broadcast_a32_unable_to_propagate_v0.

Disabling machine-cse gives us something like this:

64B       %80:vgpr_32 = V_MOV_B32_e32 1123418112, implicit $exec
80B       %81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
96B       %82:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
112B      %83:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
128B      %84:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
144B      %85:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
...
608B      %70.sub1:areg_1024 = COPY %82:agpr_32
624B      %70.sub2:areg_1024 = COPY %83:agpr_32
640B      %70.sub3:areg_1024 = COPY %84:agpr_32
656B      %70.sub4:areg_1024 = COPY %85:agpr_32
672B      %70.sub5:areg_1024 = COPY %86:agpr_32
...

Which then the AGPR copies is folded into the accvgpr_write by the simple register coalescing pass.

64B       %80:vgpr_32 = V_MOV_B32_e32 1123418112, implicit $exec
80B       undef %70.sub0:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
96B       %70.sub1:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
112B      %70.sub2:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
128B      %70.sub3:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
144B      %70.sub4:areg_1024 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
...

This seems to be perfectly fine and is what we wanted since there are no AGPR copies and the accvgpr_writes are using vgpr sources. So I believe the issue here is that the machine-cse pass is removing something that we wish to keep and replacing it with something possibly worse. Eliminating accvgpr_write and replacing them with AGPR copies does not seem to be equivalent or better on targets that are not able to copy from AGPR to AGPR directly.

This is an issue because the REG_SEQUENCE will be expanded into multiple AGPR copies during twoaddressinstruction pass.

%80:vgpr_32 = V_MOV_B32_e32 1123418112, implicit $exec
%81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 killed %80:vgpr_32, implicit $exec
undef %70.sub0:areg_1024 = COPY %81:agpr_32
%70.sub1:areg_1024 = COPY %81:agpr_32
%70.sub2:areg_1024 = COPY %81:agpr_32
%70.sub3:areg_1024 = COPY %81:agpr_32
...

Can we somehow catch it at this point? This seems to be a culprit.

This is an issue because the REG_SEQUENCE will be expanded into multiple AGPR copies during twoaddressinstruction pass.

%80:vgpr_32 = V_MOV_B32_e32 1123418112, implicit $exec
%81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 killed %80:vgpr_32, implicit $exec
undef %70.sub0:areg_1024 = COPY %81:agpr_32
%70.sub1:areg_1024 = COPY %81:agpr_32
%70.sub2:areg_1024 = COPY %81:agpr_32
%70.sub3:areg_1024 = COPY %81:agpr_32
...

Can we somehow catch it at this point? This seems to be a culprit.

I agree this does seem to be the culprit so we would need to catch it between this pass and register allocation. However, I'm not sure which pass in-between those two should have the responsibility of catching this or whether there should be a new pass to handle this.

rampitec added a comment.EditedJun 30 2021, 10:23 AM
%80:vgpr_32 = COPY %12:sgpr_32
%81:agpr_32 = V_ACCVGPR_WRITE_B32_e64 %80:vgpr_32, implicit $exec
%70:areg_1024 = REG_SEQUENCE %81:agpr_32, %subreg.sub0, %81:agpr_32, %subreg.sub1, %81:agpr_32 ...

I agree this does seem to be the culprit so we would need to catch it between this pass and register allocation. However, I'm not sure which pass in-between those two should have the responsibility of catching this or whether there should be a new pass to handle this.

It looks like a special case where we splat an sgpr into an areg... What will happen if we catch it in FixSGPRCopies and keep reg_sequence initializing areg directly from the sgpr?

The other possibility is to try catching it before two address pass in the SIFoldOperands and attempt to fold vgpr into reg_sequence. It should possible to play with mir tests to see how it will work.

rampitec added inline comments.Jun 30 2021, 10:28 AM
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.ll
581

In any way this comment will need to be removed and function renamed if the problem is resolved.

vangthao abandoned this revision.Sep 23 2021, 3:21 PM