Details
- Reviewers
arsenm tpr - Commits
- rG96e51ed005a9: [AMDGPU] Implement copyPhysReg for 16 bit subregs
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I don't think copies of these should ever be produced (at leasts for the high half) since the high half is not really addressable, and only appears that way to some instructions. Where are copies coming from?
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
701 | V_PACK_B32_F16 has some FP flushing properties and is not suitable for a copy. I think you have to do essentially what D74740 does |
First, hi16 registers are used by load_hi instructions, that is their destination. And then RA can happily copy anything to anything. For sanity we need to know how to copy any register.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
701 | I cannot do it here, I would need to scavenge a physreg for a mask, either if I use v_perm_b32 (if available) or v_bfi_b32... In fact I do not see a good instruction to do it if v_pack_b32 does not work. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
701 | Yes, there are definitely missing instructions to handle this well. I think you can use V_ALIGNBIT_B32 without an extra register in a subset of cases |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
701 | It does not work for the most needed thing: copy low to low. Well, in fact it does not help at all. |
The high result isn't what's encoded though, so they really are writing the 32-bit register. They only read the low 16-bits. I think the correct way to model this is a 32-bit write but only a 16-bit read
I think declaring the high 16 is the output register is still wrong and not how it's encoded. Having only the 16-bit read is still an improvement
The testcase I showed in a parent revision only works if I define both, and for a reason: only so we can model independent subreg access.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
701 | Another thing which doesn't work is mov with sdwa. It needs dst_preserve and then it needs tied operand for this. If pack doesn't work I can think only about an extremely ugly solution in a general case: clear destination bits with two shifts, then use v_or_b16 with op_sel. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
701 | Oh, wait... We do not have v_or_b16. Then v_add_u16 with op_sel, afair it will preserve the other half of the destination. |
Added braces.
llvm/test/CodeGen/AMDGPU/lo16-hi16-physreg-copy.mir | ||
---|---|---|
106 | Good catch about the same VGPR. I will need some special handling for it. |
Added handling of DestReg == SrcReg.
Added test for killed/undef/full reg partial copy.
V_PACK_B32_F16 has some FP flushing properties and is not suitable for a copy. I think you have to do essentially what D74740 does