Amend the loop unroll thresholds for PAL shaders to be more aggressive.
This gives an overall performance benefit on a representative sample
of shaders.
Details
- Reviewers
rampitec - Group Reviewers
Restricted Project - Commits
- rG6821a3ccd69f: [AMDGPU] Add attribute for target loop unroll threshold default
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 41245 Build 41435: arc lint + arc unit
Event Timeline
Could use a test
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
93–94 | This would now be dead | |
101 | These should probably be the same for all OSes | |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h | ||
52 ↗ | (On Diff #224612) | You don't need to add this field. You already have the subtarget available here, you just need to change the type |
How big was the performance testing?
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
62 | This change penalizes loops which should have unroll boosted instead. Your new default thresholds are now higher than boosted. |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
62 | I see now change here. Is something weird going on with the diff? |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
62 | I now initialise ThresholdLocal to be the max of UnrollThresholdLocal and UP.Threshold., so the value used will only be increased for PAL. | |
93–94 | This value is still used if the OS isn't PAL. | |
101 | That is quite possible, but I don't have tests to confirm that for all OSes. Since the effect of crossing a cliff-edge can be significant (good or bad) I don't want to risk making that change for OSes without performance figures to justify it. | |
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h | ||
52 ↗ | (On Diff #224612) | Good point. I didn't pay enough attention when I resolved the merge of my code - fixed. |
I disagree to the idea of having different thresholds based on the runtime. A runtime has nothing to do with it. For example compute can work on top of ROCm or PAL. Can you justify different results for the same programs?
I understand that you have some code or codes which benefit from a specific threshold. I suggest you to analyze these codes, understand and explain the performance gain root cause, then create a new heuristic in this function. That is why this function exists in the first place. It will also allow you to create a testcase.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp | ||
---|---|---|
62 | It still does not make sense. You are initializing general threshold higher (1100) than boosted (1000). |
In addition it can be unrelated to the threshold at all. It may be a flaw in the cost model for specific instructions. Please also see D68881 which started to address cost model issues.
Function attribute for loop unroll threshold default
Changed approach for this in response to review comments. The new change is less invasive
and avoids OS/environment specific behavior in the target specific heuristics.
It allows the front-end to provide a minimum value for the loop unroll threshold on a
per-function basis, while still allowing the heuristics to adjust that threshold.
Added test to confirm that the amdgpu-unroll-threshold attribute has the expected effect.
This change penalizes loops which should have unroll boosted instead. Your new default thresholds are now higher than boosted.