This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Amend target loop unroll defaults
ClosedPublic

Authored by timcorringham on Oct 11 2019, 9:16 AM.

Details

Summary

Amend the loop unroll thresholds for PAL shaders to be more aggressive.
This gives an overall performance benefit on a representative sample
of shaders.

Diff Detail

Event Timeline

timcorringham created this revision.Oct 11 2019, 9:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2019, 9:16 AM
timcorringham added a reviewer: Restricted Project.Oct 11 2019, 9:21 AM

Could use a test

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
93–94

This would now be dead

101

These should probably be the same for all OSes

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
52 ↗(On Diff #224612)

You don't need to add this field. You already have the subtarget available here, you just need to change the type

rampitec requested changes to this revision.Oct 11 2019, 11:05 AM
rampitec added a subscriber: rampitec.

How big was the performance testing?

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
62

This change penalizes loops which should have unroll boosted instead. Your new default thresholds are now higher than boosted.

This revision now requires changes to proceed.Oct 11 2019, 11:05 AM
nhaehnle added inline comments.Oct 14 2019, 9:36 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
62

I see now change here. Is something weird going on with the diff?

timcorringham marked 6 inline comments as done.

Changes to address review comments

timcorringham added inline comments.Oct 17 2019, 7:34 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
62

I now initialise ThresholdLocal to be the max of UnrollThresholdLocal and UP.Threshold., so the value used will only be increased for PAL.

93–94

This value is still used if the OS isn't PAL.

101

That is quite possible, but I don't have tests to confirm that for all OSes. Since the effect of crossing a cliff-edge can be significant (good or bad) I don't want to risk making that change for OSes without performance figures to justify it.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
52 ↗(On Diff #224612)

Good point. I didn't pay enough attention when I resolved the merge of my code - fixed.

I disagree to the idea of having different thresholds based on the runtime. A runtime has nothing to do with it. For example compute can work on top of ROCm or PAL. Can you justify different results for the same programs?

I understand that you have some code or codes which benefit from a specific threshold. I suggest you to analyze these codes, understand and explain the performance gain root cause, then create a new heuristic in this function. That is why this function exists in the first place. It will also allow you to create a testcase.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
62

It still does not make sense. You are initializing general threshold higher (1100) than boosted (1000).

In addition it can be unrelated to the threshold at all. It may be a flaw in the cost model for specific instructions. Please also see D68881 which started to address cost model issues.

Function attribute for loop unroll threshold default

Changed approach for this in response to review comments. The new change is less invasive
and avoids OS/environment specific behavior in the target specific heuristics.
It allows the front-end to provide a minimum value for the loop unroll threshold on a
per-function basis, while still allowing the heuristics to adjust that threshold.

This looks acceptable, but needs test.

Added test to confirm that the amdgpu-unroll-threshold attribute has the expected effect.

This revision is now accepted and ready to land.Nov 20 2019, 11:41 AM
This revision was automatically updated to reflect the committed changes.