This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add amdgpu specific loop threshold metadata
ClosedPublic

Authored by timcorringham on Jul 28 2020, 11:23 AM.

Details

Summary

Added custom loop metadata amdgpu.loop.unroll.threshold metadata to allow the
initial target specific threshold value to be specified on a loop by loop basis.
The intention is to be able to to allow more nuanced hints, e.g. specifying a
low threshold value rather than using llvm.loop.unroll.disable.

Diff Detail

Event Timeline

timcorringham created this revision.Jul 28 2020, 11:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2020, 11:23 AM
timcorringham requested review of this revision.Jul 28 2020, 11:23 AM

Fixed case style of variable.

Do you have a pointer to how this is planned to be used in the frontend? Why is the per-function attribute not enough?

There is some more info about the motivation for this in https://github.com/GPUOpen-Drivers/llpc/pull/863

Basically it gives the front end more flexibility in how it treats unroll hints. It has been shown that a lot of user supplied hints are not optimal for amdgpu if obeyed strictly. Hopefully this is generic enough to be used in other as yet unforeseen ways.

Isn't there generic metadata for this already? I find the existing large set of loop metadata confusing as it is

This is now patch number 3 or so that want's to introduce duplication of existing functionality for AMDGPU. This is an alarming trend. @gregrodgers

https://llvm.org/docs/LangRef.html#llvm-loop-unroll

jdoerfert requested changes to this revision.Aug 6 2020, 11:43 AM
This revision now requires changes to proceed.Aug 6 2020, 11:43 AM

To answer Matt's question, there isn't currently metadata to set the threshold - there is a function attribute that I added that allows the default to be set which is similar, but that of course uses the same value for all loops within a function. However, that doesn't give the per-loop control we have since discovered we need for graphics applications.

For Johannes concerns, I could undertake to remove the function attribute in a future change if that helps? I would have to modify our front-end to use this new metadata rather than the function attribute first, but then the function attribute could be removed from LLVM (although that is a single line of code). As the new metadata is amdgpu specific it has no impact on the generic loop metadata, and is only used by AMDGPUTTIImpl::getUnrollingPreferences().

arsenm added inline comments.Aug 7 2020, 2:16 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
124

I have 2 concerns about this. First there's nothing AMDGPU specific about this. Second, the unroll threshold interpretation isn't really a stable concept. What happens if we adjust the cost model and now all the programs using this metadata change?

Why is directly setting the threshold this way useful compared to setting a target unroll factor?

My concern is that we have a way to annotate loops with unroll specific metadata, and this introduces an AMDGPU specific way to do something that we want generically (as evident by the function attribute).
So, if you need per-loop control, why not introduce ‘llvm.loop.unroll.threshold? I mean, at the end of the day, what is AMDGPU specific about this?

timcorringham added inline comments.Aug 10 2020, 5:21 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
124

You are correct that this isn't intrinsically AMDGPU specific, although it is something that so far hasn't been required by any other target. As it is used by the target specific unroll options rather than in generic code it isn't actually useful for any other target unless they are changed to use it, so I made it target specific in order to avoid cluttering the generic llvm.loop metadata with something that is only used by one target. I'm happy to make it llvm.loop.unroll.threshold if that is acceptable to everyone.

A target unroll factor can be better (and that is still available) - although it requires some analysis by the front end to determine what value is appropriate. Our front end used to do this, which didn't always work out well, and also increased compile time significantly. Increasing the default threshold (using the function attribute) produced better overall results. We have now determined that we benefit from unrolling some DontUnroll loops, but that the amount depends on several factors. Again, that either need some front end analysis, or a low threshold for those loops - which gives a worthwhile gain overall for little compile time effort.

When the unroll factor isn't forced (unroll.disable, unroll.full, or unroll.count) then any cost model changes (or any other change that affects the threshold or loop size) can result in a change to the unroll behavior. That can be appropriate, just as a fixed unroll factor may not be with the same changes. I don't think there is a clear correct answer, just a compromise.

As I said in my reply to Matt, I made it target specific to avoid cluttering the llvm.loop metadata with something that is only used by one target. I am happy to make it llvm.loop.unroll.threshold if that is considered better.

However, I'm not really sure how generic it is as no other target has felt the need to do this (at least publicly). The function attribute is also amdgpu specific. Neither are used in any generic code, only in the target specific getUnrollingPreferences().

Is the consensus to change it to llvm.llop.unroll.threshold?

As I said in my reply to Matt, I made it target specific to avoid cluttering the llvm.loop metadata with something that is only used by one target. I am happy to make it llvm.loop.unroll.threshold if that is considered better.

However, I'm not really sure how generic it is as no other target has felt the need to do this (at least publicly). The function attribute is also amdgpu specific. Neither are used in any generic code, only in the target specific getUnrollingPreferences().

Is the consensus to change it to llvm.llop.unroll.threshold?

Yes, but first I think we need to back up and establish why directly putting arbitrary thresholds in the IR is useful in the first place

Ignoring the explicit unroll hints (disable, full, count) for now, the loop unrolling is controlled by the threshold. Loops are unrolled. either fully or partially, up to the threshold.

The threshold is intended to allow loops to be unrolled in cases where there is likely to be a performance gain, but to avoid unrolling loops to the extent that the performance is affected by the code size (or in our case register pressure - which threshold doesn't directly relate to). So the default threshold value is a compromise which is set such that some gains are seen with typical code, but avoiding the bad cases. The default threshold value is adjusted by heuristics which increase it in cases where a higher value is likely to be beneficial. Currently the default threshold is 150 for compute, but 700 for graphics (set using the function attribute). The value of 700 was determined to be the best compromise by testing a (large) representative sample of graphics apps. It isn't the best value for a lot of shaders, but avoids the worst effects of excessive register pressure. For some specific applications different default threshold values are passed in (again using the function attribute).

Returning to the explicit unroll hints, it has been observed that for a lot of graphics applications the hints are not optimal for amdgpu. In particular a number of applications benefit (significantly) from unrolling some of the loops that have unroll.disable hints. However, allowing all of them to be unrolled results in register pressure in some cases, Unrolling small loops achieves most of the gains, while avoiding the bad cases - that can be achieved by using a lower threshold for such loops (250 seems to be a good compromise value).

There have been several attempts at performing analysis in the font-end to set the unroll count as a more precise way of controlling the amount of unrolling. In the general case these have not worked as well as simply using the threshold, and increased compile time significantly.

So that brings us back to this change, which provides a simple and cheap mechanism to specify a different threshold for each loop. There is no impact on anything that doesn't use the threshold metadata, but it offers some flexibility controlled by the front-end in cases where it is beneficial. It isn't a perfect solution, just a pragmatic feature.

Changed the metadata from amdgpu.loop.unroll.threshold to llvm.loop.unroll.threshold

Also added a description of this metadata to LangRef.rst

When the unroll factor isn't forced (unroll.disable, unroll.full, or unroll.count) then any cost model changes (or any other change that affects the threshold or loop size) can result in a change to the unroll behavior. That can be appropriate, just as a fixed unroll factor may not be with the same changes. I don't think there is a clear correct answer, just a compromise.

I guess the question is, what does a threshold of X mean? It seems to be an intrinsically magic number that is only meaningful for a particular version of the compiler (as @arsenm pointed out before). I mean, if anything changes, in the IR, the loop threshold computation, the input, ... the behavior can flip from unrolling to not unrolling. Most of these things are out of the control for the user, that is, the user actually not any better of than before by specifying a threshold of X instead of a unroll factor of Y, is she? Maybe I misunderstand the use case here, could you elaborate who/when the metadata is generated and what it is influenced by?

Also, if we conclude this is useful, it should be useful for the regular unroller, right? So we should check for the metadata there too I guess.

Amended variable name

You are correct that the threshold value is something of a magic number, but it is what the unroll pass uses to help decide whether to unroll a loop fully or partially. Even when using an unroll factor the amount of unrolling performed may change in response to changes which affect the loop size computation if that pushes the loop over the threshold.

The threshold used by the loop unroll pass is set by the target specific loop preferences, which can use heuristics to adjust it. The llvm.loop.unroll.threshold is used by the amdgpu target loop preferences as the starting value before heuristics are applied, but it would be possible for the value to be used in other ways by other targets.

The motivation for this is our vulkan front end which can have some insights into the workload that are not apparent from the IR alone, so the metadata is a way to use that information to help guide the loop unroll process. It may be that the value passed will be changed over time in response to test results if they indicate that other changes have affected the unrolling decisions, but that is no different to a lot of other settings.

Does anyone still have concerns with this change?

dstuttard accepted this revision.Sep 23 2020, 3:56 AM

This change looks good to me, especially now that you've made the requested changes.

jdoerfert resigned from this revision.Sep 23 2020, 10:13 AM
This revision is now accepted and ready to land.Sep 23 2020, 10:13 AM
arsenm requested changes to this revision.Sep 23 2020, 10:42 AM

With the current subject, I don't think this would have received the appropriate attention. I think it should be reposted with a more generic subject for people more familiar with loop metadata

This revision now requires changes to proceed.Sep 23 2020, 10:42 AM

OK I did create a new review, https://reviews.llvm.org/D88215, with the proposed metadata made generic. The result of that review is that it should not be generic.

Are there still objections to the amdgpu specific metadata?

arsenm added inline comments.Oct 21 2020, 8:03 AM
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
124

Should the name re-gain the amdgpu part?

Changed the metadata back to the AMDGPU specific amdgpu.loop.unroll.threshold

arsenm accepted this revision.Oct 22 2020, 7:00 AM
This revision is now accepted and ready to land.Oct 22 2020, 7:00 AM
This revision was landed with ongoing or failed builds.Oct 22 2020, 9:23 AM
This revision was automatically updated to reflect the committed changes.