This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Sink immediate VGPR defs if high RP
AbandonedPublic

Authored by vangthao on Jan 18 2022, 6:37 AM.

Details

Summary

MachineLICM hoists any immediate defs because they are deemed trivially rematerializable without checking for RP in hopes that RA will pull them down again if needed. However this does not take occupancy into account and RA will not pull these defs down to increase occupancy even if hoisting them decreased occupancy in the first place. Thus this patch tries to manually sink simple imm defs, has one def and has one use, in the preheader back into loops if occupancy is too low in attempts of lowering RP and increasing occupancy.

Diff Detail

Event Timeline

vangthao created this revision.Jan 18 2022, 6:37 AM
vangthao requested review of this revision.Jan 18 2022, 6:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2022, 6:37 AM
vangthao updated this revision to Diff 400831.Jan 18 2022, 6:43 AM

Run clang-format on patch.

foad added inline comments.Jan 18 2022, 8:22 AM
llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
772

What's special about occupancy 1?

vangthao added inline comments.Jan 18 2022, 8:36 AM
llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
772

There is nothing special about occupancy 1 in particular. I am only using 1 as the minimum because it fixes a regression case we are seeing in SWDEV-316487.

rampitec requested changes to this revision.Jan 18 2022, 1:02 PM

The code as written does not guarantee it will increase the occupancy if a def was sunk. Moreover it will sink defs even in regions with low RP. Essentially it may easily pessimise the code without improving anything. You really need to check that sinking is beneficial.

It is also irrelevant to check for occupancy 1. If you are doing that your main criteria is if sinking will increase the occupancy or not as long as current occupancy is not at requested maximum.

Needs tests, plenty of tests.

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
116

No need to use sd::map, you can use a map from llvm.

131

Is this ever possible to have a def w/o a def?

This revision now requires changes to proceed.Jan 18 2022, 1:02 PM
vangthao updated this revision to Diff 401816.Jan 20 2022, 4:37 PM

Perform checking if occupancy will be increased with sinking and only sink if occupancy is increased.
Sink only the minimum needed to improve occupancy and avoid unnecessary sinking in low RP blocks.
Added more tests.

vangthao marked an inline comment as done.Jan 20 2022, 4:38 PM
vangthao added inline comments.
llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
131

Some inline asm were giving me assertion issues.

vangthao updated this revision to Diff 402505.Jan 24 2022, 6:36 AM

Fix wrong order of args sent to isEarlierInstr(). Added test case with multiple subregs defs in use block.

more comments to follow

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
139

you can use single ++SeenSubRegDefs[Op.getSubReg()] instead of all this "find and if/else". If there is no entry in the map it will be inserted with 0 value and then incremented.

226

given that you've checked !MRI->hasOneUse(Reg) before you just need to take the first use, no need to iterate.

241

how VGPRImmUse can be true if you expect MRI->hasOneUse(Reg) ?

268

use IsVGPRDst instead of TRI->isVGPRClass(RC)

Adding @foad, we need more eyes for this huge patch.

In a nutshell this is a rematerialization which shall be done by the RA. Although RA only does it when it needs to spill and does not take occupancy into account. Ideally it would be better to teach RA to do so via some sort of a target callback, but I guess it will not be easy.

Looking at the patch and given we cannot do it in RA itself I start to think that scheduler is better suited for this job, maybe as an extra last step. It already has our RPTracker and collected all that info, so I guess it shall be less code. Plus all of the sinking may be not needed if scheduler was able to improve the occupancy.

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
222

You need to check this mov does not have extra implicit operands, e.g. SIInstrInfo::isReallyTriviallyReMaterializable().

324

It is better to check it earlier, before adding to the list of sinkable registers.

327

TII.reMaterialize()?

754

Why delegate it to processReg() and not have all this code here? processReg() becomes really hard to read.

Looking at the patch and given we cannot do it in RA itself I start to think that scheduler is better suited for this job, maybe as an extra last step. It already has our RPTracker and collected all that info, so I guess it shall be less code. Plus all of the sinking may be not needed if scheduler was able to improve the occupancy.

I agree that it would be best to perform this sinking during or after scheduling to see if we really need to sink these instructions. Can this be implemented in a follow-up patch? Will need to spend more time looking into the implementation.

foad added a comment.Jan 26 2022, 2:51 AM

Adding @foad, we need more eyes for this huge patch.

There are lots of moving parts here. In general MachineLICM seems bad at tracking register pressure. (For AMD people see SC1-2887.) In particular this comment is pertinent:

// Besides removing computation from the loop, hoisting an instruction has
// these effects:
//
// - The value defined by the instruction becomes live across the entire
//   loop. This increases register pressure in the loop.

but as far as I can see, the code does nothing to account for the extra pressure throughout the whole loop.

One idea we have discussed before is to simply disable MachineLICM for any loop where the register pressure is already "too high" (which would have to be decided by some target hook).

As for RA not sinking defs back into the loop, I guess this is because RA does not know about occupancy? So here's a radical idea: after (pre-RA) scheduling, when we know what occupancy we want to achieve, why not make that occupancy a hard requirement, marking all other vgprs reserved? Then RA would have to stick to the budget, spilling if necessary. And yes spilling can be really bad, but just failing to meet your occupancy target can also be really bad.

As for RA not sinking defs back into the loop, I guess this is because RA does not know about occupancy? So here's a radical idea: after (pre-RA) scheduling, when we know what occupancy we want to achieve, why not make that occupancy a hard requirement, marking all other vgprs reserved? Then RA would have to stick to the budget, spilling if necessary. And yes spilling can be really bad, but just failing to meet your occupancy target can also be really bad.

I think we don't know the occupancy we want to achieve unless there is a user requirement. In general its hard to say what is more beneficial: to have larger occupancy or less spills

foad added a comment.Jan 26 2022, 4:05 AM

As for RA not sinking defs back into the loop, I guess this is because RA does not know about occupancy? So here's a radical idea: after (pre-RA) scheduling, when we know what occupancy we want to achieve, why not make that occupancy a hard requirement, marking all other vgprs reserved? Then RA would have to stick to the budget, spilling if necessary. And yes spilling can be really bad, but just failing to meet your occupancy target can also be really bad.

I think we don't know the occupancy we want to achieve unless there is a user requirement. In general its hard to say what is more beneficial: to have larger occupancy or less spills

Yes it is hard, but what alternative do we have? I think having the compiler make a decision about what occupancy it is trying to achieve, is better than not making any decision at all. And the scheduler already makes that decision, it is just not enforced.

Yes it is hard, but what alternative do we have? I think having the compiler make a decision about what occupancy it is trying to achieve, is better than not making any decision at all. And the scheduler already makes that decision, it is just not enforced.

I forgot how the scheduler currently works, I thought it just tries not to make the occupancy worse. I don't know how can we make an assumption for occupancy to enforce it. Given this task we could probably find how many sinks we can do and find out the achievable occupancy. This could also include not only loop sinks but any constants holding VGPRs throughout a function.

This feels way too specific. Immediate defs aren't really a special case

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp
53–54

Isn't this already tracked in the MFI

268

No else after return

Yes it is hard, but what alternative do we have? I think having the compiler make a decision about what occupancy it is trying to achieve, is better than not making any decision at all. And the scheduler already makes that decision, it is just not enforced.

I forgot how the scheduler currently works, I thought it just tries not to make the occupancy worse. I don't know how can we make an assumption for occupancy to enforce it. Given this task we could probably find how many sinks we can do and find out the achievable occupancy. This could also include not only loop sinks but any constants holding VGPRs throughout a function.

Right, by the time we hit pre-ra scheduler, occupany may already be decreased by MachineLICM and since the scheduler does not currently have any way to decrease register pressure of live-throughs, it would not have any ability to increase occupancy if it was decreased by MachineLICM hoisting from loops. Enforcing an occupancy target for RA will help in this case. Another issue is I believe the scheduler will allow for more VGPR usage as long as it does not decrease occupancy. This may hide some achievable occupancy if we do not handle this before scheduler or make scheduler aware that it can increase occupancy by sinking. I am not sure if disabling MachineLICM altogether in high RP situations would be the best solution here.

This feels way too specific. Immediate defs aren't really a special case

Immediate defs was the cause of an issue that I saw in a reported regression case so that was the main focus here. I think we can expand this to target all trivially rematerializable instructions.

In D117562#3272690, @vangthao wrote:
Right, by the time we hit pre-ra scheduler, occupany may already be decreased by MachineLICM and since the scheduler does not currently have any way to decrease register pressure of live-throughs, it would not have any ability to increase occupancy if it was decreased by MachineLICM hoisting from loops. Enforcing an occupancy target for RA will help in this case. Another issue is I believe the scheduler will allow for more VGPR usage as long as it does not decrease occupancy. This may hide some achievable occupancy if we do not handle this before scheduler or make scheduler aware that it can increase occupancy by sinking. I am not sure if disabling MachineLICM altogether in high RP situations would be the best solution here.

So, the idea is to enforce occupancy achieved after scheduling?

BTW, I saw another kind of VGPR usage which can be rematerialized: CSEd addresses "Base + Offset" in some cases can be register consuming.

In D117562#3272690, @vangthao wrote:
Right, by the time we hit pre-ra scheduler, occupany may already be decreased by MachineLICM and since the scheduler does not currently have any way to decrease register pressure of live-throughs, it would not have any ability to increase occupancy if it was decreased by MachineLICM hoisting from loops. Enforcing an occupancy target for RA will not help in this case. Another issue is I believe the scheduler will allow for more VGPR usage as long as it does not decrease occupancy. This may hide some achievable occupancy if we do not handle this before scheduler or make scheduler aware that it can increase occupancy by sinking. I am not sure if disabling MachineLICM altogether in high RP situations would be the best solution here.

So, the idea is to enforce occupancy achieved after scheduling?

BTW, I saw another kind of VGPR usage which can be rematerialized: CSEd addresses "Base + Offset" in some cases can be register consuming.

Sorry I made a typo. I meant it will not help to enforce occupancy if we do not increase occupancy before scheduling or make the scheduler aware that sinking can be done to increase occupancy.

Setting hard occupancy limit after scheduling might be a good idea actually. At that point we know an approximate register pressure pretty well, so if there will be spilling it shall be minimal and can be worth increasing the occupancy. In fact we may even want some small amount of spilling in return of occupancy, that often helps. All of that is a subject of thresholds and heuristics of course. I see 2 problems though:

  1. Scheduler will calculate RP with instructions hoisted. It will then need to identify rematerializable instructions which can be sunk and estimate the profitability of sinking them so it could artificially increase the occupancy. This is not necessarily an easy task.
  2. For gfx90a there a combined pressure of VGPRs and AGPRs because they are essentially aliases of the same registers, and then there is ACCUM_OFFSET to set the split. To actually reserve registers we would need to account for AGPRs separately and quite precisely. But even then that will be a problem if we are scheduling a standalone function because at the end of the day there is a single ACCUM_OFFSET for the whole kernel. Right now we are giving RA a generous register budget and hope for the best. If we start reserving registers we will be unable to do it.

I am currently thinking about something like that:

  • Collect trivially rematerializable instructions with a single use outside of the def block at a first scheduler pass.
  • Collect blocks with a high RP (AFAIR we are already doing that).
  • Add a new scheduler pass after the current passes.
  • Filter the collected list of instructions only keeping those with (only) uses in the blocks limiting the RP and high live-through pressure.
  • Count how much instructions we have. If enough to cover register deficit proceed with sinking defs right to the use.
  • Reschedule region into which we have sunk the def. If successful in increasing occupancy keep the schedule. If not revert it.

The caveat is to keep revert list for multiple blocks because even a single block will limit the occupancy for the whole kernel. I'd suggest to only do it if we have a single block limiting occupancy at least initially.

Note that this method does not need Loop Info as well.

llvm/lib/Target/AMDGPU/GCNPreRAOptimizations.cpp