This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] SIFoldOperands should not fold register acrocc the EXEC definition
ClosedPublic

Authored by alex-t on Sep 17 2019, 8:28 AM.

Details

Summary

Register that is defined by the copy that implicitly uses EXEC should not be folded in case there exist EXEC definition between register definition and instruction to which it is going to be folded. Otherwise, scalar values may be exposed outside the divergent loop w/o copying to VGPR.

For instance:

r1 = copy r0 imp use exec
exec = def_exec

some_inst use(r1)

we cannot fold r0 to some_inst

Diff Detail

Event Timeline

alex-t created this revision.Sep 17 2019, 8:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 17 2019, 8:28 AM
rampitec added inline comments.Sep 17 2019, 10:42 AM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
422

We have execMayBeModifiedBeforeUse(), although it does not do cross block scan. I.e. you need to extend execMayBeModifiedBeforeUse() instead of creating another function with the same intent.

423

Formatting is off.

428

Even if it can fall through exec can be modified.

429

What if there are several blocks and branches in between of def and use?

433

Just Def->readsRegister(AMDGPU::EXEC, TRI).

434

modifiesRegister(ANDGPU::EXEC, TRI). Otherwise it will not work in wave32.

alex-t marked an inline comment as done.Sep 19 2019, 12:41 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
429

In fact, the only case we should care of is the definition inside the loop and use outside.
In case loop has the divergent exit we have temporal dependency.
All other cases may be broken into the following:

  1. Definition dominates the divergent branch but use does not post-dominate the definition. This means that the use is in one of the branch's target blocks: all threads that have taken this block observe same value. This is true irrespective of how many divergent branches are in between.
  2. Definition dominates the divergent branch and use post-dominates the definition. This means that the exec mask is the same for use and definition irrespective of how many times it was modified in between.

So, I'd check for the exact condition: the copy uses exec and is inside the loop and use is outside the loop and loop has divergent exit and there is a path from this exit to the use block.

alex-t updated this revision to Diff 221368.Sep 23 2019, 11:14 AM

LoopFinder class was employed to detect if the register copy source cannot be folded.
The idea is to use uniform method to detect the exec mask merge necessity and the COPY movement/folding restrictions.
LoopFinder was refactored out of the SILowerI1Copies to AMDGPU/Utils for this purpose.
addLoopEntries method was refactored from the LoopFinder to SILowerI1Copies
because it changes the Machine Function and naturally belongs to the pass that is the ancestor of MachineFunctionPass.
LoopFinder is the analysis that should not change the MachineFunction.

Unfortunately LoopFinder does not store the detected loop's exits. So we cannot check if they modify EXEC.
So we'd opt for over-conservative but correct approach.

LoopFinder class was employed to detect if the register copy source cannot be folded.
The idea is to use uniform method to detect the exec mask merge necessity and the COPY movement/folding restrictions.
LoopFinder was refactored out of the SILowerI1Copies to AMDGPU/Utils for this purpose.
addLoopEntries method was refactored from the LoopFinder to SILowerI1Copies
because it changes the Machine Function and naturally belongs to the pass that is the ancestor of MachineFunctionPass.
LoopFinder is the analysis that should not change the MachineFunction.

Unfortunately LoopFinder does not store the detected loop's exits. So we cannot check if they modify EXEC.
So we'd opt for over-conservative but correct approach.

Building a PDT for every invocation of the SIFoldOperands seems a too heavy hummer for a too small problem.
I would rather just call execMayBeModifiedBeforeUse() and bail.

alex-t updated this revision to Diff 221732.Sep 25 2019, 5:57 AM

lightweight solution.

You need a test showing the problem you are solving.

alex-t updated this revision to Diff 222192.Sep 27 2019, 10:17 AM

test added

This revision is now accepted and ready to land.Sep 27 2019, 10:23 AM
alex-t closed this revision.Oct 1 2019, 7:22 AM