This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Implement copyPhysReg for 16 bit subregs
ClosedPublic

Authored by rampitec on Feb 20 2020, 4:51 PM.

Diff Detail

Event Timeline

rampitec created this revision.Feb 20 2020, 4:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 20 2020, 4:51 PM

I don't think copies of these should ever be produced (at leasts for the high half) since the high half is not really addressable, and only appears that way to some instructions. Where are copies coming from?

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
701

V_PACK_B32_F16 has some FP flushing properties and is not suitable for a copy. I think you have to do essentially what D74740 does

rampitec marked an inline comment as done.Feb 20 2020, 5:10 PM

I don't think copies of these should ever be produced (at leasts for the high half) since the high half is not really addressable, and only appears that way to some instructions. Where are copies coming from?

First, hi16 registers are used by load_hi instructions, that is their destination. And then RA can happily copy anything to anything. For sanity we need to know how to copy any register.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
701

I cannot do it here, I would need to scavenge a physreg for a mask, either if I use v_perm_b32 (if available) or v_bfi_b32... In fact I do not see a good instruction to do it if v_pack_b32 does not work.

arsenm added inline comments.Feb 20 2020, 5:16 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
701

Yes, there are definitely missing instructions to handle this well. I think you can use V_ALIGNBIT_B32 without an extra register in a subset of cases

rampitec marked an inline comment as done.Feb 20 2020, 5:22 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
701

It does not work for the most needed thing: copy low to low. Well, in fact it does not help at all.

I don't think copies of these should ever be produced (at leasts for the high half) since the high half is not really addressable, and only appears that way to some instructions. Where are copies coming from?

First, hi16 registers are used by load_hi instructions, that is their destination. And then RA can happily copy anything to anything. For sanity we need to know how to copy any register.

The high result isn't what's encoded though, so they really are writing the 32-bit register. They only read the low 16-bits. I think the correct way to model this is a 32-bit write but only a 16-bit read

I don't think copies of these should ever be produced (at leasts for the high half) since the high half is not really addressable, and only appears that way to some instructions. Where are copies coming from?

First, hi16 registers are used by load_hi instructions, that is their destination. And then RA can happily copy anything to anything. For sanity we need to know how to copy any register.

The high result isn't what's encoded though, so they really are writing the 32-bit register. They only read the low 16-bits. I think the correct way to model this is a 32-bit write but only a 16-bit read

Low16 are preserved and if we say we write 32 bit then we cannot model it.

I don't think copies of these should ever be produced (at leasts for the high half) since the high half is not really addressable, and only appears that way to some instructions. Where are copies coming from?

First, hi16 registers are used by load_hi instructions, that is their destination. And then RA can happily copy anything to anything. For sanity we need to know how to copy any register.

The high result isn't what's encoded though, so they really are writing the 32-bit register. They only read the low 16-bits. I think the correct way to model this is a 32-bit write but only a 16-bit read

Low16 are preserved and if we say we write 32 bit then we cannot model it.

I think declaring the high 16 is the output register is still wrong and not how it's encoded. Having only the 16-bit read is still an improvement

I don't think copies of these should ever be produced (at leasts for the high half) since the high half is not really addressable, and only appears that way to some instructions. Where are copies coming from?

First, hi16 registers are used by load_hi instructions, that is their destination. And then RA can happily copy anything to anything. For sanity we need to know how to copy any register.

The high result isn't what's encoded though, so they really are writing the 32-bit register. They only read the low 16-bits. I think the correct way to model this is a 32-bit write but only a 16-bit read

Low16 are preserved and if we say we write 32 bit then we cannot model it.

I think declaring the high 16 is the output register is still wrong and not how it's encoded. Having only the 16-bit read is still an improvement

The testcase I showed in a parent revision only works if I define both, and for a reason: only so we can model independent subreg access.

rampitec marked an inline comment as done.Feb 20 2020, 5:47 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
701

Another thing which doesn't work is mov with sdwa. It needs dst_preserve and then it needs tied operand for this.

If pack doesn't work I can think only about an extremely ugly solution in a general case: clear destination bits with two shifts, then use v_or_b16 with op_sel.

rampitec marked an inline comment as done.Feb 20 2020, 5:55 PM
rampitec added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
701

Oh, wait... We do not have v_or_b16. Then v_add_u16 with op_sel, afair it will preserve the other half of the destination.

rampitec updated this revision to Diff 247139.Feb 27 2020, 5:07 PM

Changed expansion to pairs of binary instructions.
Verified expansion with RT test.

arsenm added inline comments.Feb 28 2020, 7:35 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
717

Braecss

723

Braecs

llvm/test/CodeGen/AMDGPU/lo16-hi16-physreg-copy.mir
106

Needs some tests with both halves in the same 32-bit register

Also need some with kill and undef flag handling

rampitec updated this revision to Diff 247337.Feb 28 2020, 12:10 PM
rampitec marked 3 inline comments as done.

Added braces.

llvm/test/CodeGen/AMDGPU/lo16-hi16-physreg-copy.mir
106

Good catch about the same VGPR. I will need some special handling for it.

rampitec planned changes to this revision.Feb 28 2020, 12:23 PM
rampitec updated this revision to Diff 247381.Feb 28 2020, 3:44 PM
rampitec marked an inline comment as done.

Added handling of DestReg == SrcReg.
Added test for killed/undef/full reg partial copy.

rampitec updated this revision to Diff 247382.Feb 28 2020, 3:46 PM

Uploaded full context diff.

rampitec added a reviewer: tpr.Apr 7 2020, 1:14 PM
arsenm accepted this revision.Apr 7 2020, 2:05 PM
This revision is now accepted and ready to land.Apr 7 2020, 2:05 PM
This revision was automatically updated to reflect the committed changes.