This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix regression with vectorization limiting
ClosedPublic

Authored by rampitec on Mar 31 2022, 2:47 PM.

Details

Summary

D67148 has removed TTI::getNumberOfRegisters(bool Vector) and
started to call TTI::getNumberOfRegisters(unsigned ClassID) from
the LoopVectorize. This has resulted in an unrestricted vectorization
on AMDGPU blowing up register pressure.

Diff Detail

Event Timeline

rampitec created this revision.Mar 31 2022, 2:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 2:47 PM
rampitec requested review of this revision.Mar 31 2022, 2:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 2:47 PM
Herald added a subscriber: wdng. · View Herald Transcript
arsenm added inline comments.Mar 31 2022, 2:49 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
310

4 seems really small

rampitec added inline comments.Mar 31 2022, 2:51 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
310

It is enough to allow vectorization, all we need really. Giving more immediately explodes RP because of the interleaving. That can be possible to increase this, but then limit interleaving much more.

rampitec added inline comments.Mar 31 2022, 3:12 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
310

Here is the loop triggered the investigation:

for (int i = rowStart; i < rowEnd; i++) {
  gq += temp[i];
}

gs/temp are float. The whole kernel w/o loop-vectorize uses 9 VGPRs, with the vecotrizer as it is now 78. With this change it goes down to 38 which is still higher than wanted. If I allow 8 registers final budget is 78 VGPRs again, and to bring it back down to 38 I have to disable interleave. Even interleave factor of 2 plus 8 registers reported here results in 46 VGPRs.

rampitec updated this revision to Diff 419586.Mar 31 2022, 4:22 PM

I have realized that RCID passed into getNumberOfRegisters(unsigned RCID) is in fact not an RCID, but boolean for vector/scalar registers. We could implement getRegisterClassForType() to change that, but we cannot reasonably distinguish between VGPRs and SGPRs anyway. At the end result is clamped to just 4, so it is easier to remove all of these calculations and simply return 4.

Note that 4 was the most common return value before the regression. Optimistically we assume max occupancy which means 24 vgprs on most targets. Return value was max available vgprs / 8, i.e. 4. Exceptions were an implicit request of the maximum occupancy, a rare thing, and Navi with different vgpr to occupancy mappings.

This is a gross regression and I want more eyes on this. The only reason we didn't immediately spot it is because of not so much perf reports from gfx90a, where we have packed f32.

It seems to me Cost::RateFormula() form LSR is the only user that can be affected by the change. Would you please double-look the use case?

It seems to me Cost::RateFormula() form LSR is the only user that can be affected by the change. Would you please double-look the use case?

Yes, I believe unbound LSR also hits us with RP a lot. Limiting it to just 4 'free' pointers is a good thing IMO.

Ping. Performance testing was done on gfx90a for the tests of interest, change gives roughly 10% increase in all subtests.

dfukalov accepted this revision.Apr 8 2022, 5:00 PM
This revision is now accepted and ready to land.Apr 8 2022, 5:00 PM
This revision was landed with ongoing or failed builds.Apr 8 2022, 5:47 PM
This revision was automatically updated to reflect the committed changes.