A 64 bit add implimented with two 32 bit add instructions uses the $SCC register for carry. The S_LSHL_B32 instruction write the $SCC reg Pass was incorrectly transforming S_LSHL_B32 S_ADD_U32 S_ADDC_U32 into S_ADD_U32 S_LSHL_B32 S_ADDC_U32 SILoadStoreOptimizer constructs a list of instructions to move and did not take into account the dependent S_ADD_u32 instruction when it chose to move the S_ADDC_U32 instruction.
Details
Diff Detail
Event Timeline
You should use computeRegisterLiveness instead of adding a more naive search for a def, which may not exist.
The testcase can also be a lot simpler (and should probable be MIR)
test/CodeGen/AMDGPU/scc-add-lshl-addc.ll | ||
---|---|---|
79 | I doubt you need any of this metadata |
test will convert to MIR form.
Patch will change to use computeRegisterLiveness. i will have to use a pretty large neighborhood , as the original code this error occurred in (before running bugpoint) , had the s_add_u32 instruction was separated by over 400 instructions from the s_addc_u32 instruction. We will assert fail if we cannot find the s_add_u32 instruction, so that will alert us to increase neighborhood size. This patch will also handle the corresponding sub instructions.
test/CodeGen/AMDGPU/scc-add-lshl-addc.ll | ||
---|---|---|
79 | it should vanish once i convert test to MIR form |
correction: the original input test did NOT have the instructions separated by more than 1 or 2 instructions The resultant output showed the large separation.
The default neighborhood of 10 is probably more than enough.
after looking at the suggestion of using computeRegisterLiveness, I noticed that it does not return the MI where the register in question is most recently defined.
Rather, it informs on liveness within a range. I dont really see how I would use this method effectively?
The problem I am trying to solve requires identifying a specific instruction that is needed by a subsequent instruction and then adding the identified instruction to a list constructed by SILoadStoreOptimizer.
Regarding the lit test, i cleaned it up quite a bit and will post a new version shortly.
I will create a .MIR test from this to specifically test the SILoadStoreOptimizer pass in a direct fashion.
Any objection to keeping the the .ll test after the MIR test is added?
This is correction issue, not performance. Any kind of threshold is an error. If you did not find the def you probably need to bail.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
276 | It can be defined in another block. It can be also undef. |
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
276 | Splitting a pair of instructions across basic block boundaries in this situation seems really crazy. These instruction pairs are implementing a 64 bit add or 64 bit subtract. I understand that generally speaking we could see both situations (split or under). If this were to occur in this pass, i would want to assert (which is what this patch will do) so we can go look into it, rather than having broken code generated. To split them would mean that $SCC is live in to the block. |
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
276 | Why not? What if half of that pair was hoisted out of the block into parent? |
The current problem i am trying to resolve in somewhat analogous to hoisting 1/2 of the 64 bit add instruction pair. Although in this particular situation we are actually sinking 1/2 of the instruction pair into a later position within the same block. And yes, i can see how in the future a new machine instruction pass might choose to hoist one of the instructions into a pred BB. I realize i can write additional code to scan a previous block. However i think its better that passes not hoist part of an instruction pair, especially ones such as these. To that end i would rather see my patch assert so that we are forced to deal with such a situation should it arise.
Your example, btw, is a good one for why we should have an IR test for the current problem, rather than an MIR test. An MIR test that runs just before SILoadStoreOptimizer will not detect the affects of a new pass. Whereas the IR test attached to this patch stands a better chance of detecting the issue.
i think bailing the optimization if not found within some reasonable distance (10 seems to be popular), is a good suggestion. Much better than aborting. thx
Added check for instr match missing, and bail on optimization if so.
I prefer the .ll test we have for the patch now over that of creating an MIR test for this issue.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
305 | Does that really mean to bail? Check the uses. |
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
305 | i see what you mean about the bail and uses , thx. |
test/CodeGen/AMDGPU/scc-missing-add.mir | ||
---|---|---|
1 ↗ | (On Diff #194514) | after playing with trying to reduce the IR, I don't really think I can. It sort of falls into this category as described in the MIR documentation: MIR code contains a whole IR module. This is necessary because there are no equivalents in MIR for global variables, references to external functions, function attributes, metadata, debug info. Instead some MIR data references the IR constructs. You can often remove them if the test doesn’t depend on them. And the above really complicates trying to merge the two tests into one. |
test/CodeGen/AMDGPU/scc-has-add.mir | ||
---|---|---|
10–12 ↗ | (On Diff #194514) | The SILoadStoreOptimizer pass depends on Alias analysis on memory references as part of its decision to collect a group of instructions. So when i attempt to removes this information from the MIR, the test does not reproduce the issue. Similiarly, trying to remove the IR section also introduces issues with reproducing the issue, as their are PC relative definitions needed in the MIR test. So i think i need to keep the MIR test as is with a bit of cleanup. Also, since the scc-has-add.mir .test closely replicates the scc-add-lshl-addc.ll test, i plan to keep the .ll test, dump the scc-has-add.mir test, and keep the scc-missing-add,mir test. |
slightly generalized to some physical reg. only look at previous instruction.
The definition is either there, and were all good, or we will bail.
Added use of LivePhysRegs, happily lifted some code Krzy wrote for Hexagon to compute getLiveRegsAt.
lib/Target/AMDGPU/SILoadStoreOptimizer.cpp | ||
---|---|---|
196 | Capitalize | |
200–201 | Indentation | |
270–271 | This is going to do a ~full scan of the block for every analyzed instruction, so this ends up being O(N^2). I was thinking more a single LivePhysReg instance for the entire block visit, which is lazily moved to the current point as necessary | |
293 | This may not be broad enough. It only covers full defs. Usually modifiesRegister is what you want | |
297–301 | The idea with using LivePhysRegs is to stop using this custom PhysRegUses set | |
300 | The implicitness doesn't matter |
Capitalize