This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Tune scheduler on GFX10 and GFX11
AbandonedPublic

Authored by rampitec on Jan 13 2023, 2:17 PM.

Details

Reviewers
kerbowa
foad
Summary

Unlike older ASICs GFX10+ have a lot of VGPRs. Therefore, it is possible
to achieve high occupancy even with all or almost all addressable VGPRs
used. Our scheduler was never tuned for this scenario. The VGPR Critical
Limit threshold always comes very high, even if maximum occupancy is
targeted. For example on gfx1100 it is set to 192 registers even with
the requested occupancy 16. As a result scheduler starts prioritizing
register pressure reduction very late and we easily end up spilling.

This patch makes scheduling on new targets much closer to GFX9. The
value of VGPR critical limit is based on the number of addressable
registers and not on a total VGPR budget.

The intent of the patch is to have no impact on GFX9 and older targets,
a massive lit tests update shows no changes on these.

Diff Detail

Event Timeline

rampitec created this revision.Jan 13 2023, 2:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 2:17 PM
rampitec requested review of this revision.Jan 13 2023, 2:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2023, 2:17 PM
Herald added a subscriber: wdng. · View Herald Transcript

I think this makes sense, but it also makes the concept of an occupancy target a misnomer.

Additionally, I worry that we may be over-prioritizing RP reduction when we consistently see cases where higher RP isn't necessarily leading to better performance.

This situation is basically why the HighRPReschedule stage exists - where we target a higher occupancy to see if occupancy was dropped because the heuristics are not focusing RP reduction soon enough. This HighRP stage will only increase the target occupancy by one and decrease the critical register limit by an additional 10. We could probably be more aggressive with these numbers in cases where we have spilling. For example, setting the HighRPErrorMargin to something very high like 200 also eliminates the spilling in the testcase for the relevant ticket (SWDEV-377300).

That may be a more targeted change, but if your feeling is that this patch will be a more general improvement it looks good to me.

I think this makes sense, but it also makes the concept of an occupancy target a misnomer.

Additionally, I worry that we may be over-prioritizing RP reduction when we consistently see cases where higher RP isn't necessarily leading to better performance.

This situation is basically why the HighRPReschedule stage exists - where we target a higher occupancy to see if occupancy was dropped because the heuristics are not focusing RP reduction soon enough. This HighRP stage will only increase the target occupancy by one and decrease the critical register limit by an additional 10. We could probably be more aggressive with these numbers in cases where we have spilling. For example, setting the HighRPErrorMargin to something very high like 200 also eliminates the spilling in the testcase for the relevant ticket (SWDEV-377300).

That may be a more targeted change, but if your feeling is that this patch will be a more general improvement it looks good to me.

Austin, thanks for the review. I agree this makes concept of occupancy based scheduling somewhat dissolved, although I do not think it completely misrepresents the pass now. This only changes a single threshold while others still use occupancy. This is more of a wording issue to me anyway, but I certainly will try to play with HighRPReschedule, mainly because I am a bit afraid of the amount of gfx10/gfx11 scheduling changes and their potential impact. If that is possible to only tackle high RP issue when we really hit spilling and prioritize ILP otherwise, that would be ideal for sure.

Meanwhile, it would be interesting to see an overall impact of this patch as it is on graphics and gfx10 (possibly gfx11 too). @foad do you mind to help with that? If we see it has an overall improvement we may prefer current approach anyway. That said the issue shall affect wave64 kernels less than wave32.

This situation is basically why the HighRPReschedule stage exists - where we target a higher occupancy to see if occupancy was dropped because the heuristics are not focusing RP reduction soon enough. This HighRP stage will only increase the target occupancy by one and decrease the critical register limit by an additional 10. We could probably be more aggressive with these numbers in cases where we have spilling. For example, setting the HighRPErrorMargin to something very high like 200 also eliminates the spilling in the testcase for the relevant ticket (SWDEV-377300).

In fact ErrorMargin is completely misused by the HighRPReschedule. This number is to account for the RPTracker inaccuracy, not to bias the limits. In this case it certainly does not help to bias SGPR limit by something like 200. I.e. it is possible to do it in the HighRPReschedule but it needs some better infrastructure.

The other consideration is that doing so only in the regions with high RP may leave little room for improvement. In the testcase discussed there is high enough livein pressure and it will not be affected as it comes from a block with a lesser pressure. On the other hand doing this math always makes it more balanced.

This situation is basically why the HighRPReschedule stage exists - where we target a higher occupancy to see if occupancy was dropped because the heuristics are not focusing RP reduction soon enough. This HighRP stage will only increase the target occupancy by one and decrease the critical register limit by an additional 10. We could probably be more aggressive with these numbers in cases where we have spilling. For example, setting the HighRPErrorMargin to something very high like 200 also eliminates the spilling in the testcase for the relevant ticket (SWDEV-377300).

In fact ErrorMargin is completely misused by the HighRPReschedule. This number is to account for the RPTracker inaccuracy, not to bias the limits. In this case it certainly does not help to bias SGPR limit by something like 200. I.e. it is possible to do it in the HighRPReschedule but it needs some better infrastructure.

The other consideration is that doing so only in the regions with high RP may leave little room for improvement. In the testcase discussed there is high enough livein pressure and it will not be affected as it comes from a block with a lesser pressure. On the other hand doing this math always makes it more balanced.

The other consideration is that we do not really need to go unclustered here, just make adequate thresholds.

There is a conservative alternative which might be better: D141876. That one should not affect kernels which do not currently spill. I am keeping this one for the time being in case we decide it is overall better.

rampitec abandoned this revision.Jan 20 2023, 11:11 AM

Testing showed negative performance impact. D141876 is the way to go.