Page MenuHomePhabricator

[AMDGPU] Optimizing unnecessary copies for REG_SEQUENCE PHI operand. Also fixes rocBLAS error
Needs ReviewPublic

Authored by alex-t on Dec 5 2019, 1:21 PM.

Details

Summary

This change optimizes out unnecessary copies that appear in the case when the PHI node input register is defined by the REG_SEQUENCE.
The REG_SEQUENCE not necessarily is located in the PHI's predecessor block.

rocBLAS error is really caused by the subregisters tracking bug. PostRA Machine Sink cannot discover that sreg_32 writes are defining superreg that is used as source operand in subsequent COPY. Machine Sink then sinks the superreg COPY to the "THEN" block of the branch but the copies of sreg32 that deine superreg COPY source operand - to the "ELSE" block. As a result "THEN" block receives uninitialized register.
This issue is going to be addressed in separate change later on.

The optimization in current change removes the IR pattern that triggers the Machine Sink bug.
Since it need to be done anyway to remove unnecessary copies, it goes first.

Diff Detail

Event Timeline

alex-t created this revision.Dec 5 2019, 1:21 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 5 2019, 1:21 PM

Looks good, but the test would be nice to have.

arsenm added a comment.Dec 6 2019, 9:27 AM

Definitely needs a test. I'm not sure I understand the fix here. This is some kind of workaround? I would be happier with the subreg fix happening first. Removing unnecessary copies should not be relevant to the correctness fix?

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
275

Factoring this out is a separate patch

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
4274

I think the legalization done here should only touch the local instruction, and not be scanning through copy chains. This seems like it's trying to avoid a failure of the PeepholeOptimizer?

4290–4294

Needs comments

4303

Doesn't need to be a reference

alex-t marked an inline comment as done.EditedDec 6 2019, 10:25 AM

Definitely needs a test. I'm not sure I understand the fix here. This is some kind of workaround? I would be happier with the subreg fix happening first. Removing unnecessary copies should not be relevant to the correctness fix?

Agree. The reason to try submitting this first is that change in lib/CodeGen/MachineSink.cpp that fixes the issue will likely take a long time to be approved.
This issue with this https://reviews.llvm.org/D68635 patch breaks one of the rocBLAS kernels correctness because of sinking subreg definition out from the superreg use.
This in order creates several duplicated P1 bugs.

Removing unnecessary copies just covers the IR pattern that drives PostRA Machine Sink mad.

If I could to speedup the PostRA Machine Sink fix review somehow, I prefer to submit it first and work on the copy elimination patch refactoring to make it more elegant.

The PostRA Machine Sink fix is here: https://reviews.llvm.org/D71132

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
4274

Initially I had intention to generalize your code that handles single COPY user of REG_SEQUENCE (SIFixSGPRCopies.cpp). Then I realized that the pattern I'm chasing to eliminate is only related to the PHIs.
Namely, it is the single-use chain of copies that connects REG_SEQUENCE that produce the value to the PHI that consumes it. The chain is possibly going through the several BBs.
That's why I opted for the backward search from the PHI up to the REG_SEQUENCE .
I changed the walk in the runOnMachineFunction main loop to post order to ensure that PHI is processed first, before the REG_SEQUENCE.
I agree that generalizing foldVGPRCopyIntoRegSequence would be more neat.