This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Make SREG_LO16 legal
AbandonedPublic

Authored by rampitec on Apr 24 2020, 1:48 PM.

Details

Reviewers
arsenm
vpykhtin
Summary

This includes a hack to fix illegal 32 to 16 bit copies.
The problem is when we make 16 bit subregs legal it creates
a huge amount of failures which can only be resolved at once
without a temporary hack like this.

The next step is to change operands, instruction definitions
and patterns until this hack is not needed.

There are also quite obvious regressions as seen in the tests,
but surprisingly not as huge as I expected.

I am not really sure it needs to be pushed right away, but
I definitely want to review it now and agree on the general
direction.

Diff Detail

Event Timeline

rampitec created this revision.Apr 24 2020, 1:48 PM
asbirlea removed a subscriber: asbirlea.Apr 24 2020, 2:35 PM

Could we leave all instructions as 32-bit defs, but then have a 16-bit subreg copy as the only use?

%0:vgpr_32 = V_FOO_U16
%1:vgpr_16lo = COPY %0.sub16_lo
....

The register allocator would understand this and fold out the copy

llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
315

This is a red flag, this should never happen

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
544–547

We should make it so that MachineCopyPropagation can deal with this case

545

Why can't you just erase the instruction?

rampitec marked 3 inline comments as done.Apr 24 2020, 3:59 PM

Could we leave all instructions as 32-bit defs, but then have a 16-bit subreg copy as the only use?

%0:vgpr_32 = V_FOO_U16
%1:vgpr_16lo = COPY %0.sub16_lo
....

The register allocator would understand this and fold out the copy

So you want to keep SReg_32 as an RC for i16 and f16? And add a copy to each instruction producing a 16 bit value?

I am afraid it will not work. Instruction needs to produce i16 for which legal class is SReg_32, we will return SReg_lo16. Selector will either complain, not match or emit yet another copy. The problem is a 16 bit value needs to leave either in a 32 bit or a 16 bit RC, but not both.

llvm/lib/Target/AMDGPU/AMDGPUMCInstLower.cpp
315

This is in place of redundant mov v0, v0 when one side is 16 bit and another 32. When all is done it should go away.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
544–547

Hm... So you want it always? I thought this is a red flag.

545

Because caller decrements the iterator expecting new instruction to be there. It then transfers implicit operands there. This is also a hack and frankly breaks if we emit more than one instruction. Maybe we need to emit a bundle instead of a sequence.

This is also why DELETED is created.

Could we leave all instructions as 32-bit defs, but then have a 16-bit subreg copy as the only use?

%0:vgpr_32 = V_FOO_U16
%1:vgpr_16lo = COPY %0.sub16_lo
....

The register allocator would understand this and fold out the copy

So you want to keep SReg_32 as an RC for i16 and f16? And add a copy to each instruction producing a 16 bit value?

I am afraid it will not work. Instruction needs to produce i16 for which legal class is SReg_32, we will return SReg_lo16. Selector will either complain, not match or emit yet another copy. The problem is a 16 bit value needs to leave either in a 32 bit or a 16 bit RC, but not both.

This would probably require new support in the InstrEmitter or patterns but I don't think is impossible

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
544–547

I mean MCP should eliminate these identity copies. It shouldn't see the src == dest due to 16-bit obscuring this (at least for > -O0)

Could we leave all instructions as 32-bit defs, but then have a 16-bit subreg copy as the only use?

%0:vgpr_32 = V_FOO_U16
%1:vgpr_16lo = COPY %0.sub16_lo
....

The register allocator would understand this and fold out the copy

So you want to keep SReg_32 as an RC for i16 and f16? And add a copy to each instruction producing a 16 bit value?

I am afraid it will not work. Instruction needs to produce i16 for which legal class is SReg_32, we will return SReg_lo16. Selector will either complain, not match or emit yet another copy. The problem is a 16 bit value needs to leave either in a 32 bit or a 16 bit RC, but not both.

This would probably require new support in the InstrEmitter or patterns but I don't think is impossible

This is what it really needs to look like, otherwise it's lying about producing a 16-bit result. If we pretend it really only writes 16-bits, we can't make use of the different high bit handling behaviors

Could we leave all instructions as 32-bit defs, but then have a 16-bit subreg copy as the only use?

%0:vgpr_32 = V_FOO_U16
%1:vgpr_16lo = COPY %0.sub16_lo
....

The register allocator would understand this and fold out the copy

So you want to keep SReg_32 as an RC for i16 and f16? And add a copy to each instruction producing a 16 bit value?

I am afraid it will not work. Instruction needs to produce i16 for which legal class is SReg_32, we will return SReg_lo16. Selector will either complain, not match or emit yet another copy. The problem is a 16 bit value needs to leave either in a 32 bit or a 16 bit RC, but not both.

This would probably require new support in the InstrEmitter or patterns but I don't think is impossible

This is what it really needs to look like, otherwise it's lying about producing a 16-bit result. If we pretend it really only writes 16-bits, we can't make use of the different high bit handling behaviors

Hm... That's true. Source operands must be 16 bit and destination 32.

rampitec marked an inline comment as done.Apr 27 2020, 5:00 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
544–547

Could it? Assume:

$vgpr1 = $vgpr2_lo16
%vgpr1 = %vgpr2.lo16

It does not tell anything about the contents of a high half after the copy. Basically semantics of these is unclear. We can argue a reverse situation is a truncation, but really undefined as well because it would be unrepresentable.

Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2020, 5:00 PM
rampitec updated this revision to Diff 260959.Apr 29 2020, 11:06 AM
  • Rebased.
  • Dropped DELETED node, just return an empty BUNDLE.

I still do not believe we need to make SReg_LO16 legal just yet, but maybe we need the rest.

arsenm added inline comments.Apr 29 2020, 11:43 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
544–547

I'm not sure I exactly mean MCP. I'm sort of thinking the allocator would have turned the 16-bit subreg into the same 32-bit reg

arsenm added inline comments.Apr 29 2020, 11:45 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
151–152

These are still here, although contrary to your last comment that we don't need to make these legal yet? (which I interpreted as using as a legal type reg class)

rampitec marked an inline comment as done.Apr 29 2020, 11:53 AM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
151–152

Yes, this is still here just for discussion purposes. I think I will drop this part and revert tests as well, but add a mir test for copyPhysReg() instead.

rampitec marked an inline comment as done.Apr 29 2020, 1:09 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
151–152

CopyPhysReg is extracted into D79119.

rampitec abandoned this revision.May 4 2020, 8:41 AM