HomePhabricator

[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.
Concern Raisedc3492a1aa1b9

Authored by hliao on Sep 9 2020, 1:48 PM.

Description

[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.

  • Need to lower COPY from SGPR to VGPR to a real instruction as the standard COPY is used where the source and destination are from the same register bank so that we potentially coalesc them together and save one COPY. Considering that, backend optimizations, such as CSE, won't handle them. However, the copy from SGPR to VGPR always needs materializing to a native instruction, it should be lowered into a real one before other backend optimizations.

Differential Revision: https://reviews.llvm.org/D87556

Event Timeline

arsenm raised a concern with this commit.Sep 17 2020, 11:42 AM
arsenm added a subscriber: arsenm.

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

This commit now has outstanding concerns.Sep 17 2020, 11:42 AM

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

hliao added a comment.Sep 17 2020, 2:32 PM

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

This does not make sense, COPY is what generic optimizations do understand. If it's useful to CSE cross bank copies, MachineCSE should handle them. As a representational choice, not-copy is worse than copy

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

This does not make sense, COPY is what generic optimizations do understand. If it's useful to CSE cross bank copies, MachineCSE should handle them. As a representational choice, not-copy is worse than copy

I ran into this before and apparently peephole optimizer should handle this: https://groups.google.com/g/llvm-dev/c/a4jKBqCJIDM/m/BRu3sWopBAAJ

hliao added a comment.Sep 18 2020, 8:23 AM

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

This does not make sense, COPY is what generic optimizations do understand. If it's useful to CSE cross bank copies, MachineCSE should handle them. As a representational choice, not-copy is worse than copy

A target-independent cross-bank COPY is definitely useful but the current COPY should not be used for that purpose considering how it's used in RA-related passes, especially only architectural constraints are changed between the source and destination operands where the propagation should be stopped.

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

This does not make sense, COPY is what generic optimizations do understand. If it's useful to CSE cross bank copies, MachineCSE should handle them. As a representational choice, not-copy is worse than copy

I ran into this before and apparently peephole optimizer should handle this: https://groups.google.com/g/llvm-dev/c/a4jKBqCJIDM/m/BRu3sWopBAAJ

We should a generic

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

This does not make sense, COPY is what generic optimizations do understand. If it's useful to CSE cross bank copies, MachineCSE should handle them. As a representational choice, not-copy is worse than copy

I ran into this before and apparently peephole optimizer should handle this: https://groups.google.com/g/llvm-dev/c/a4jKBqCJIDM/m/BRu3sWopBAAJ

Hi Matt

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

This does not make sense, COPY is what generic optimizations do understand. If it's useful to CSE cross bank copies, MachineCSE should handle them. As a representational choice, not-copy is worse than copy

I ran into this before and apparently peephole optimizer should handle this: https://groups.google.com/g/llvm-dev/c/a4jKBqCJIDM/m/BRu3sWopBAAJ

But, peephole opt has fundamental limits on non-local code motion, such as LICM, which is important to reduce the code strength within a loop. For those opts, we don't have redundant copies but need a better place to hold them.

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

This does not make sense, COPY is what generic optimizations do understand. If it's useful to CSE cross bank copies, MachineCSE should handle them. As a representational choice, not-copy is worse than copy

A target-independent cross-bank COPY is definitely useful but the current COPY should not be used for that purpose considering how it's used in RA-related passes, especially only architectural constraints are changed between the source and destination operands where the propagation should be stopped.

This is exactly what copy is for. It wouldn't make sense to have a separate copy for this

Can you revert this? I think that this transform itself is a workaround, and even if it were a good idea, I think it doesn't belong in another loop over the function in finalizeLowering

Could you elaborate on why that would be a workaround? Basically, after instruction selection, the COPY from SGPR to VGPR should be lowered to a native instruction.

Because this should be done after register allocation like is already done. Replacing a copy with something else should only interfere with generic optimizations

That would be too late for MachineCSE and other optimization to remove the redundant COPYs and reduce the register usage. Moving that after RA won't reduce register pressure.

So you are working around something MahcineCSE isn't doing on copies. You could just have MachineCSE do this for these copies

MachineCSE and other optimizations are designed not to handle that target-independent COPY or like, which is added with the intention that, potentially, the source and destination operands are coalesced and that COPY is removed finally. As SGPR and VGPR are different register banks and won't be coalesced anyway, native instruction should be used instead.

This does not make sense, COPY is what generic optimizations do understand. If it's useful to CSE cross bank copies, MachineCSE should handle them. As a representational choice, not-copy is worse than copy

A target-independent cross-bank COPY is definitely useful but the current COPY should not be used for that purpose considering how it's used in RA-related passes, especially only architectural constraints are changed between the source and destination operands where the propagation should be stopped.

This is exactly what copy is for. It wouldn't make sense to have a separate copy for this

They serve different purposes. Cross-reg-bank COPY is purely for value propagation as it must be materialized. But, coalescable COPY (on compatible register classes) needs to satisfy architectural constraints on different instructions and need to find a better register assignment to remove it if possible. The former needs eliminating redundant ones but the later needs skipping by most optimizations without comprehensive target-specific info. They need to be treated very differently.

They serve different purposes. Cross-reg-bank COPY is purely for value propagation as it must be materialized. But, coalescable COPY (on compatible register classes) needs to satisfy architectural constraints on different instructions and need to find a better register assignment to remove it if possible. The former needs eliminating redundant ones but the later needs skipping by most optimizations without comprehensive target-specific info. They need to be treated very differently.

Maybe treated differently by some optimizations, but not replaced with moves