This is an archive of the discontinued LLVM Phabricator instance.

SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshl
AbandonedPublic

Authored by ronlieb on Apr 9 2019, 6:15 AM.

Details

Reviewers
rampitec
arsenm
Summary
A 64 bit add implimented with two 32 bit add instructions uses the
$SCC register for carry.  The S_LSHL_B32 instruction write the $SCC reg
Pass was incorrectly transforming
  S_LSHL_B32
  S_ADD_U32
  S_ADDC_U32
into
  S_ADD_U32
  S_LSHL_B32
  S_ADDC_U32

SILoadStoreOptimizer constructs a list of instructions to move and did
not take into account the dependent S_ADD_u32 instruction when it chose
to move the S_ADDC_U32 instruction.

Diff Detail

Event Timeline

ronlieb created this revision.Apr 9 2019, 6:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 9 2019, 6:15 AM
arsenm requested changes to this revision.Apr 9 2019, 6:42 AM

You should use computeRegisterLiveness instead of adding a more naive search for a def, which may not exist.

The testcase can also be a lot simpler (and should probable be MIR)

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll
80

I doubt you need any of this metadata

This revision now requires changes to proceed.Apr 9 2019, 6:42 AM
ronlieb marked an inline comment as done.Apr 9 2019, 7:05 AM

test will convert to MIR form.
Patch will change to use computeRegisterLiveness. i will have to use a pretty large neighborhood , as the original code this error occurred in (before running bugpoint) , had the s_add_u32 instruction was separated by over 400 instructions from the s_addc_u32 instruction. We will assert fail if we cannot find the s_add_u32 instruction, so that will alert us to increase neighborhood size. This patch will also handle the corresponding sub instructions.

test/CodeGen/AMDGPU/scc-add-lshl-addc.ll
80

it should vanish once i convert test to MIR form

correction: the original input test did NOT have the instructions separated by more than 1 or 2 instructions The resultant output showed the large separation.
The default neighborhood of 10 is probably more than enough.

after looking at the suggestion of using computeRegisterLiveness, I noticed that it does not return the MI where the register in question is most recently defined.
Rather, it informs on liveness within a range. I don’t really see how I would use this method effectively?
The problem I am trying to solve requires identifying a specific instruction that is needed by a subsequent instruction and then adding the identified instruction to a list constructed by SILoadStoreOptimizer.

Regarding the lit test, i cleaned it up quite a bit and will post a new version shortly.
I will create a .MIR test from this to specifically test the SILoadStoreOptimizer pass in a direct fashion.
Any objection to keeping the the .ll test after the MIR test is added?

ronlieb updated this revision to Diff 194355.Apr 9 2019, 9:38 AM

correction: the original input test did NOT have the instructions separated by more than 1 or 2 instructions The resultant output showed the large separation.
The default neighborhood of 10 is probably more than enough.

This is correction issue, not performance. Any kind of threshold is an error. If you did not find the def you probably need to bail.

lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
284

It can be defined in another block. It can be also undef.

ronlieb marked an inline comment as done.Apr 9 2019, 1:25 PM
ronlieb added inline comments.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
284

Splitting a pair of instructions across basic block boundaries in this situation seems really crazy. These instruction pairs are implementing a 64 bit add or 64 bit subtract. I understand that generally speaking we could see both situations (split or under). If this were to occur in this pass, i would want to assert (which is what this patch will do) so we can go look into it, rather than having broken code generated. To split them would mean that $SCC is live in to the block.

rampitec added inline comments.Apr 9 2019, 1:33 PM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
284

Why not? What if half of that pair was hoisted out of the block into parent?

i agree it could happen. Not sure what to do about it here.

The current problem i am trying to resolve in somewhat analogous to hoisting 1/2 of the 64 bit add instruction pair. Although in this particular situation we are actually sinking 1/2 of the instruction pair into a later position within the same block. And yes, i can see how in the future a new machine instruction pass might choose to hoist one of the instructions into a pred BB. I realize i can write additional code to scan a previous block. However i think its better that passes not hoist part of an instruction pair, especially ones such as these. To that end i would rather see my patch assert so that we are forced to deal with such a situation should it arise.
Your example, btw, is a good one for why we should have an IR test for the current problem, rather than an MIR test. An MIR test that runs just before SILoadStoreOptimizer will not detect the affects of a new pass. Whereas the IR test attached to this patch stands a better chance of detecting the issue.

The current problem i am trying to resolve in somewhat analogous to hoisting 1/2 of the 64 bit add instruction pair. Although in this particular situation we are actually sinking 1/2 of the instruction pair into a later position within the same block. And yes, i can see how in the future a new machine instruction pass might choose to hoist one of the instructions into a pred BB. I realize i can write additional code to scan a previous block. However i think its better that passes not hoist part of an instruction pair, especially ones such as these. To that end i would rather see my patch assert so that we are forced to deal with such a situation should it arise.
Your example, btw, is a good one for why we should have an IR test for the current problem, rather than an MIR test. An MIR test that runs just before SILoadStoreOptimizer will not detect the affects of a new pass. Whereas the IR test attached to this patch stands a better chance of detecting the issue.

Why not just bail the optimization if you didn't find a def reasonable close?

i think bailing the optimization if not found within some reasonable distance (10 seems to be popular), is a good suggestion. Much better than aborting. thx

ronlieb updated this revision to Diff 194420.Apr 9 2019, 4:57 PM

Added check for instr match missing, and bail on optimization if so.
I prefer the .ll test we have for the patch now over that of creating an MIR test for this issue.

rampitec added inline comments.Apr 9 2019, 5:12 PM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
333

Does that really mean to bail? Check the uses.
You also need a test where you did not find the pair, a mir test.

ronlieb marked an inline comment as done.Apr 9 2019, 6:01 PM
ronlieb added inline comments.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
333

i see what you mean about the bail and uses , thx.
good suggestion for the mir test. will add

ronlieb updated this revision to Diff 194514.Apr 10 2019, 8:00 AM

added two MIR tests,and refined logic to properly bail.

rampitec added inline comments.Apr 10 2019, 9:40 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
279

Having i and I variables in the same statement is quite misleading.

test/CodeGen/AMDGPU/scc-missing-add.mir
2

You can combine both mir tests into a single file and significantly reduce them. For example you do not need all of the IR.

ronlieb marked an inline comment as done.Apr 11 2019, 12:50 PM
ronlieb added inline comments.
test/CodeGen/AMDGPU/scc-missing-add.mir
2

after playing with trying to reduce the IR, I don't really think I can.
This particular pass seems sensitive to PC relative references within the MIR that are defined within the IR, and other symbol references as well.

It sort of falls into this category as described in the MIR documentation:

MIR code contains a whole IR module. This is necessary because there are no equivalents in MIR for global variables, references to external functions, function attributes, metadata, debug info. Instead some MIR data references the IR constructs. You can often remove them if the test doesn’t depend on them.

And the above really complicates trying to merge the two tests into one.

arsenm added inline comments.Apr 11 2019, 12:53 PM
test/CodeGen/AMDGPU/scc-has-add.mir
10–12 ↗(On Diff #194514)

You can drop the block names, and IR references in the MMOs to drop the IR section

113–115 ↗(On Diff #194514)

You can strip out a lot of instructions too. Usually I just create a smaller, totally artificial test case from scratch

ronlieb marked 2 inline comments as done.Apr 12 2019, 2:37 PM
ronlieb added inline comments.
test/CodeGen/AMDGPU/scc-has-add.mir
10–12 ↗(On Diff #194514)

The SILoadStoreOptimizer pass depends on Alias analysis on memory references as part of its decision to collect a group of instructions. So when i attempt to removes this information from the MIR, the test does not reproduce the issue. Similiarly, trying to remove the IR section also introduces issues with reproducing the issue, as their are PC relative definitions needed in the MIR test. So i think i need to keep the MIR test as is with a bit of cleanup. Also, since the scc-has-add.mir .test closely replicates the scc-add-lshl-addc.ll test, i plan to keep the .ll test, dump the scc-has-add.mir test, and keep the scc-missing-add,mir test.

ronlieb updated this revision to Diff 194964.Apr 12 2019, 2:37 PM
ronlieb updated this revision to Diff 195595.Apr 17 2019, 10:19 AM

slightly generalized to some physical reg. only look at previous instruction.
The definition is either there, and were all good, or we will bail.

ronlieb updated this revision to Diff 195750.Apr 18 2019, 8:25 AM

Added use of LivePhysRegs, happily lifted some code Krzy wrote for Hexagon to compute getLiveRegsAt.

ronlieb retitled this revision from SILoadStoreOptimizer pass mischedules s_add,s_addc with interfering s_lshl to SILoadStoreOptimizer pass schedules s_add,s_addc with interfering s_lshl.Apr 18 2019, 8:30 AM
arsenm added inline comments.Apr 29 2019, 2:30 AM
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp
197

Capitalize

201–202

Indentation

278–279

This is going to do a ~full scan of the block for every analyzed instruction, so this ends up being O(N^2). I was thinking more a single LivePhysReg instance for the entire block visit, which is lazily moved to the current point as necessary

301

This may not be broad enough. It only covers full defs. Usually modifiesRegister is what you want

325

The idea with using LivePhysRegs is to stop using this custom PhysRegUses set

328

The implicitness doesn't matter

ronlieb abandoned this revision.May 17 2019, 11:45 AM
ronlieb marked 4 inline comments as done.

Superseded by D61313