D67148 has removed TTI::getNumberOfRegisters(bool Vector) and
started to call TTI::getNumberOfRegisters(unsigned ClassID) from
the LoopVectorize. This has resulted in an unrestricted vectorization
on AMDGPU blowing up register pressure.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
310 | 4 seems really small |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
310 | It is enough to allow vectorization, all we need really. Giving more immediately explodes RP because of the interleaving. That can be possible to increase this, but then limit interleaving much more. |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
310 | Here is the loop triggered the investigation: for (int i = rowStart; i < rowEnd; i++) { gq += temp[i]; } gs/temp are float. The whole kernel w/o loop-vectorize uses 9 VGPRs, with the vecotrizer as it is now 78. With this change it goes down to 38 which is still higher than wanted. If I allow 8 registers final budget is 78 VGPRs again, and to bring it back down to 38 I have to disable interleave. Even interleave factor of 2 plus 8 registers reported here results in 46 VGPRs. |
I have realized that RCID passed into getNumberOfRegisters(unsigned RCID) is in fact not an RCID, but boolean for vector/scalar registers. We could implement getRegisterClassForType() to change that, but we cannot reasonably distinguish between VGPRs and SGPRs anyway. At the end result is clamped to just 4, so it is easier to remove all of these calculations and simply return 4.
Note that 4 was the most common return value before the regression. Optimistically we assume max occupancy which means 24 vgprs on most targets. Return value was max available vgprs / 8, i.e. 4. Exceptions were an implicit request of the maximum occupancy, a rare thing, and Navi with different vgpr to occupancy mappings.
This is a gross regression and I want more eyes on this. The only reason we didn't immediately spot it is because of not so much perf reports from gfx90a, where we have packed f32.
It seems to me Cost::RateFormula() form LSR is the only user that can be affected by the change. Would you please double-look the use case?
Yes, I believe unbound LSR also hits us with RP a lot. Limiting it to just 4 'free' pointers is a good thing IMO.
Ping. Performance testing was done on gfx90a for the tests of interest, change gives roughly 10% increase in all subtests.
4 seems really small