This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add scheduler pass to rematerialize trivial defs
ClosedPublic

Authored by vangthao on Feb 10 2022, 12:57 PM.

Details

Summary

Add a new pass in the pre-ra AMDGPU scheduler to check if sinking trivially rematerializable defs that only has one use outside of the defining block will increase occupancy. If we can determine that occupancy can be increased, then rematerialize only the minimum amount of defs required to increase occupancy. Also re-schedule all regions that had occupancy matching the previous min occupancy using the new occupancy.

This is based off of the discussion in https://reviews.llvm.org/D117562.

The logic to determine the defs we should collect and determining if sinking would be beneficial is mostly the same. Main differences is that we are no longer limiting it to immediate defs and the def and use does not have to be part of a loop.

Diff Detail

Event Timeline

vangthao created this revision.Feb 10 2022, 12:57 PM
vangthao requested review of this revision.Feb 10 2022, 12:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2022, 12:57 PM

One problem is that you are trying to handle multiple regions. You cannot estimate if an improvement in one region will actually improve an occupancy for the whole kernel, while actually sinking instructions. That's why I suggested at least to start with the scenario when you have only one region limiting the occupancy. I also hope it will simplify the logic in the sinkTriviallyRematInsts a lot, as I have to admit I am completely lost in it.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
723

Actually it does not matter if that is a VGPR. AGPRs also affect occupancy and SGPRs before gfx10 too.

733

If there is a subreg def you can bail immediately, it is not rematerializable anyway. If there is more than one def too for that matter.

752

All of that subreg machinery isn't really needed, we cannot rematerizlize subreg defs.

778

You could immediately bail if use is not in a region with high pressure, yet in the beginning of this function. No need to even add such instruction to the list.

797

You do not know it yet until you have actually rematerialized anything into this region.

809

This logic is incomplete. RematerializableInsts has to be filtered to only instructions relevant for the current region.

939

Actually you cannot just rematerialize it without checking all uses are available. Check LiveRangeEdit::allUsesAvailableAt and LiveRangeEdit::canRematerializeAt, this is the checks you need to perform before adding an instruction to the list.

963

You need to update region begin interator somewhere. Your could rematerialize an instruction into the begin of the region.

972

Use TII->isReallyTriviallyReMaterializable() instead.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
115

Isn't it the same as RegionsWithHighRP?

rampitec added inline comments.Feb 10 2022, 4:39 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
939

You could since you are refusing instructions with any uses.

972

Actually I see that you are refusing any uses, so you do not need to do the same checks as LRE before rematerialization. That's fine, but needs a comment. Disregard a comment about isReallyTriviallyReMaterializable, isTriviallyReMaterializable calls it.

vangthao updated this revision to Diff 408000.Feb 11 2022, 12:47 PM
vangthao marked 4 inline comments as done.

Reduce patch to only handle one region. Remove sinking of subreg defs. Only collect trivially rematerializable defs in high RP region.

vangthao marked 2 inline comments as done.Feb 11 2022, 12:49 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
723

Also handling AGPRs and SGPRs may add complexity to this patch so I did not want to handle those instructions as well. This patch should bail if anything other than VGPR is causing occupancy to drop.

733

If it is not rematerializable, can we perform checks to see if it can be sink to its single use?

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
115

RegionsWithHighRP is broken since it using LLVM's RPTracker to set HasExcessPressure and RPTracker does not capture live-throughs. I noticed that there can be a huge difference of register pressure during scheduling (since RPTracker is used to make decisions) and the RP after when using GCNRegPressure.

rampitec added inline comments.Feb 11 2022, 2:53 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
723

OK, but let's handle it later. Add a TODO comment.

733

It is the same as rematerinalization. Just skip all of them.

737

Also skip it is if the use is not in the desired region. Actually I do not see where do you check it at all.

797

Typo: occupance.

800

I do not think you really need a whole block above dealing with live-through estimations. First it is not guaranteed to help the RP if you sink a live-through. Second it is essentially the same as this block actually performing rematerialization, just less precise. Keep just rematerialization part.

It will currently fail to update region begin pointer if insertion point is the first instruction in its region.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
640

You can also quickly skip it if there is only one region (or block).

I don't understand why the scheduler should be doing rematerialization. The rematerialization should have been done by the coalescer if it would reduce pressure

I don't understand why the scheduler should be doing rematerialization. The rematerialization should have been done by the coalescer if it would reduce pressure

Not even a coalescer, it shall be done in RA, but RA only does it when it has to spill and knows nothing about occupancy. Scheduler is a closest place where we have all RP information and can reasonably intervene. In a long run we need to consider teaching RA about occupancy.

vangthao updated this revision to Diff 408553.Feb 14 2022, 12:20 PM

Update RegionBegin for blocks we rematerialize defs to. Skip rematerializing if only one block to schedule.

vangthao marked 6 inline comments as done.Feb 14 2022, 12:27 PM
vangthao added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
737
if (HighRPLiveIns.find(Reg) == HighRPLiveIns.end())
  continue;

Check is perform above to see if reg is live-in or not. If it is then it must be used in this block or is a live-through.

800

If the live-through is increasing RP in this block then by rematerializing don't we decrease RP by the live-throughs that we rematerialized?

rampitec added inline comments.Feb 14 2022, 12:33 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
737

Thanks. Add a comment for this check, it is not obvious.

800

I do not think LICM will hoist anything like that. Then even if so the code below will handle it as well.

vangthao added inline comments.Feb 14 2022, 12:37 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
800

The code below only checks for trivially rematerializable defs used in the high RP block. If the def is live-through and used in a low RP block then it will not be checked. This checks for that and sinks those defs to lower RP blocks so we can decrease live-through RP.

rampitec added inline comments.Feb 14 2022, 12:47 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
800

I see 2 situations where something sinkable can be live-through:

Def
loop:
  ...
cbr loop
Use

There is absolutely no reason for any pass to hoist that Def high, most likely it will be sunk much earlier.

Def
loop1:
  loop2:
    Use
  cbr loop2
cbr loop1

Assuming you have a high live-through pressure in loop1 sinking Def to Use into loop2 will unlikely help it as the highest pressure is likely in the loop2 anyway.

vangthao added inline comments.Feb 14 2022, 12:58 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
800

What about this situation? The def is live throughout the whole loop since it is needed in each iteration.

Def
loop1:
  ...
  ... (high rp block)
  ...
  use (lower rp block)
  cbr loop1
rampitec added inline comments.Feb 14 2022, 1:37 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
800

This probably can happen, I doubt it is a common situation though. But then I still do not understand why code below will not handle it? You are only collecting instructions where defined register is in the live-in pressure of the high RP block, so it will be collected. Then below you are sinking it to the use. I.e. to me it will do the same thing automatically.

vangthao added inline comments.Feb 14 2022, 1:53 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
800

We can handle both situations there but re-calculating RP is expensive so I split up the checks into two parts where we first check live-throughs then check defs that have a use in the high RP block.

For the first live-through check, it is simple to re-calculate RP since we can deduct RP from the cached pressure while the second check needs a RPTracker to go over the whole region each time we sink a def. If we can improve occupancy by only sinking live-throughs then that will reduce the amount of time we have to use RPTracker.

rampitec added inline comments.Feb 14 2022, 2:23 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
800

Besides unnecessary code complication I think you will need to recalculate RP anyway. You are going to reschedule the block to which you are sinking the instruction and since you may have several other blocks on the way this recalculation will be needed.

Speaking of which... It looks like if you have blocks in between this code will leave cached LiveIns, MBBLiveIns and Pressure for these blocks in an incorrect state.

I really like to keep it simple, at least until there is a proof a complication is necessary. Then even if it is needed it is easier to implement, debug and review it in a small steps.

Actually you also need to update RegionBegin and RegionEnd for the block where you are erasing def in case if def was the first or the last instruction (do you have a test like this?)

vangthao updated this revision to Diff 409055.Feb 15 2022, 2:15 PM

Combine logic for live-throughs and uses in block. Update RegionBegin for sink from and to region. Increment RegionIdx when we skip a region due to it being size 0 or 1 else next iteration will still point to this region.

Actually you also need to update RegionBegin and RegionEnd for the block where you are erasing def in case if def was the first or the last instruction (do you have a test like this?)

Will add a test.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
800

If this is the last pass, do we still want to fix the cached values for those blocks we are not re-scheduling?

rampitec added inline comments.Feb 15 2022, 3:10 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
800

What will happen if we add a new pass after? But updating all blocks in between can be tricky, you will need a dominator tree for that. Add a static_assert(LastStage == PreRARematerialize) and comment the reason.

822

No else after return.

830

You actually need to update RegionEnd because there can be a fallthrough without a terminator.

rampitec added inline comments.Feb 15 2022, 3:26 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
830

And actually remember that region is not the same as basic block. It may start and end in the middle of a block.

rampitec added inline comments.Feb 15 2022, 3:35 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
872

You also need to either update MBBLiveIns and BBLiveInMap cache or erase this block from the map. Presumably it will not be needed anymore, so you can simply erase it.

vangthao updated this revision to Diff 409362.Feb 16 2022, 12:16 PM
vangthao marked 12 inline comments as done.

Added static_assert requring that PreRARematerialize is the last pass. Check RegionEnd for region we are sinking from. Added two tests to check for RegionIdx being incremented correctly when a region is reduced to size 0 or 1. Erase high rp region from MBBLiveIns after sinking since it is now invalid.

vangthao added inline comments.Feb 16 2022, 12:21 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
830

Can you give an example of a fallthrough case? I thought regions were decided by isSchedBoundary().

872

BBLiveInMap seems to require the original first non-dbg instruction before scheduling as the key.

rampitec added inline comments.Feb 16 2022, 12:52 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
654

"are not supported".

830

A basic block may finish without a terminator. It will just fall through into the next block.

841

No else after break.

845

No else after break.

872

Just record it and erase.

vangthao updated this revision to Diff 409426.Feb 16 2022, 3:04 PM

Address review comments.

rampitec added inline comments.Feb 16 2022, 3:21 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
888

I mean record it before you have done changes and then erase here.

vangthao marked 4 inline comments as done.Feb 16 2022, 3:33 PM
vangthao added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
888

I believe the keys and BB live-ins are recorded before any passes are ran on the regions. This means that even the initial scheduling pass may have changed the first non-debug instruction causing us to be unable to easily obtain the key for the mapping for this MBB. Do you mean to record it before we run any scheduling passes at all?

rampitec added inline comments.Feb 16 2022, 3:36 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
888

I mean get it before you have erased instruction. Potentially the instruction it points to.

Good point about potential invalidation of the key due to scheduling, as a separate patch we need to fix these keys. Not a difficult fix.

vangthao updated this revision to Diff 409443.Feb 16 2022, 4:00 PM

If the def we are sinking maps to a live-in set in BBLiveInMap then erase it since it is now sinked to another MBB and the entry is currently invalid.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
888

Ah thanks! I think I understand now. Updated in latest diff.

rampitec added inline comments.Feb 17 2022, 11:20 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
888

I have double checked. It does not matter that BBLiveInMap have keys which may shift during scheduling because it is only used once. This code is still good though because it is generally a bad idea to leave a dangling pointer.

rampitec added inline comments.Feb 17 2022, 11:25 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
796

When this patch is landed we can try to use LRE::handleMove() here to speed up it.

830

Look at the dbg-value-ends-sched-region.mir for example. The first block is ended with a V_MOV_B32:

  %10:vreg_64 = IMPLICIT_DEF
  undef %11.sub0:vreg_64 = V_MOV_B32_e32 0, implicit $exec

bb.1:
  %12:vreg_64 = COPY %2

So if you move this instruction you will invalidate RegionEnd.

vangthao added inline comments.Feb 17 2022, 11:39 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
830

I believe RegionEnd would be equivalent to MBB.end() which is a pass the end iterator and would not point to the V_MOV_B32 in this case, or am I wrong?

rampitec added inline comments.Feb 17 2022, 12:01 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
830

OK, makes sense. None of isSchedulingBoundary() instructions are rematerializable too.

The code looks good to me. I yet have to check the tests.
In the meantime @vpykhtin and @kerbowa could you please take a look as well?

llvm/test/CodeGen/AMDGPU/debug-value-scheduler-crash.mir
25–81

Please pre-commit test changes (to use CHECK-NEXT) and rebase.

llvm/test/CodeGen/AMDGPU/sched-assert-dead-def-subreg-use-other-subreg.mir
21–46

Please pre-commit test changes (to use CHECK-NEXT) and rebase.

rampitec added inline comments.Feb 17 2022, 12:35 PM
llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir
2654

Do you have a similar test with sub1 and sub0 defined which would be sinkable if not subreg use?

vangthao updated this revision to Diff 410402.Feb 21 2022, 3:15 PM

Rebase on top of pre-committed tests. Add subreg test.

vangthao marked 2 inline comments as done.Feb 21 2022, 3:17 PM
vangthao added inline comments.
llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir
2654

Added test: test_no_sink_two_subregs_in_def_block.

kerbowa added inline comments.Feb 21 2022, 5:51 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
640

If we only try this with a single region can RegionsWithMinOcc be an Optional<int> or something? Seems like it would not be necessary to have a vector tracking all the regions that fit the criteria.

646

Is this needed? If we are limited by the function max occupancy including LDS and wave-per-eu constraints then multiple regions will be at the MinOccupancy.

652

Why is the PreRARematerialize pass different and doesn't need this info? Do we only reschedule the single region?

730

Can you add a comment as Stas suggests?

777

Can these be combined to avoid the extra lookup?

789

TYPO: does->do

811

Is there a test for this case?

872

I guess you could have a function to do this updating of RegionBegin/RegionEnd since it is repeated?

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
81

I feel like we need to refactor the way we implement scheduler passes now that there are so many of them and the logic is getting a bit complex. It's a separate thing though.

123

Can you update the comment so that it is clear these are pairs of defs/uses.

138

Why only VGPRs? Could this be extended to handle SGPRs and AGPRs.

vangthao updated this revision to Diff 410639.Feb 22 2022, 1:41 PM

Moved update of region boundaries to its own function. Added test case for occupancy not improved. Removed redundant vector containing NewMIs. Added comments.

vangthao marked 10 inline comments as done.Feb 22 2022, 1:46 PM
vangthao added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
640

Originally I wanted to handle multiple regions, but changed it to a single region to reduce complexity. In a later patch, hopefully we can handle multiple regions.

652

Yes, we only re-schedule the single high RP region. If we sink to a lower RP region, it should not have any impact on RP for that region since we are only sinking trivially rematerializable defs with no vreg uses.

llvm/lib/Target/AMDGPU/GCNSchedStrategy.h
138

Yes, this could also be extended to handle SGPRs and AGPRs. I can put a TODO here.

vpykhtin added inline comments.Feb 23 2022, 4:51 AM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
719

const GCNRPTracker::LiveRegSet &HighRPLiveIns

769
  1. BestRegDiff is a misleading name, this is actually the number of VGPRs would left after sinking in the best case.
  2. This implies that RematerializableInsts.size() is the number of 32bit VGPRs freed after sinking? I'm not sure if there are any 64bit rematerializable instructions, but what if they're eventually added?
770

Is there any way that VGPRUsage is less than RematerializableInsts.size()?

799

Just Reg instead of NewMI->getOperand(0).getReg()?

800

ditto

857

just MBBLiveIns.erase(Regions[HighRPIdx].first->getParent()) - nothing will happen if it isn't in the map

903

Should this skip debug instructions? (and other places in this function too)

914

It would be easier to understand if you search the first region for the BB and then do a loop until regions for other BBs will come.

rampitec added inline comments.Feb 23 2022, 3:32 PM
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
769

Actually an example of such instruction is V_CVT_F64_I32.

vangthao updated this revision to Diff 411281.Feb 24 2022, 6:02 PM

Check register size when calculating if we have enough instructions to sink to improve occupancy. Added tests for 64bit defs sinking and address other review comments.

vangthao marked 12 inline comments as done.Feb 24 2022, 8:51 PM
vangthao added inline comments.
llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
770

It should not be possible if we filter instructions relevant only to this block.

903

We shouldn't skip debug instructions here because they can be selected as a RegionBegin of a region.

Herald added a project: Restricted Project. · View Herald TranscriptMar 3 2022, 8:36 AM

Does it pass PSDB?

llvm/lib/Target/AMDGPU/GCNSchedStrategy.cpp
425

Just = PressureAfter.getOccupancy(ST) == MinOccupancy;

436

ditto.

vangthao updated this revision to Diff 413029.Mar 4 2022, 8:55 AM

Address review comments.

vangthao marked 2 inline comments as done.Mar 4 2022, 8:56 AM

PSDB passes.

rampitec accepted this revision.Mar 4 2022, 11:19 AM

LGTM, but please wait until Monday for others to comment.

This revision is now accepted and ready to land.Mar 4 2022, 11:19 AM
This revision was landed with ongoing or failed builds.Mar 9 2022, 9:35 AM
This revision was automatically updated to reflect the committed changes.