- User Since
- Apr 26 2017, 9:47 AM (177 w, 2 d)
Mon, Sep 14
Mon, Sep 7
Tue, Sep 1
Does anyone still have concerns with this change?
Tue, Aug 25
You are correct that the threshold value is something of a magic number, but it is what the unroll pass uses to help decide whether to unroll a loop fully or partially. Even when using an unroll factor the amount of unrolling performed may change in response to changes which affect the loop size computation if that pushes the loop over the threshold.
Amended variable name
Mon, Aug 24
Changed the metadata from amdgpu.loop.unroll.threshold to llvm.loop.unroll.threshold
Aug 18 2020
Ignoring the explicit unroll hints (disable, full, count) for now, the loop unrolling is controlled by the threshold. Loops are unrolled. either fully or partially, up to the threshold.
Aug 10 2020
As I said in my reply to Matt, I made it target specific to avoid cluttering the llvm.loop metadata with something that is only used by one target. I am happy to make it llvm.loop.unroll.threshold if that is considered better.
Aug 7 2020
To answer Matt's question, there isn't currently metadata to set the threshold - there is a function attribute that I added that allows the default to be set which is similar, but that of course uses the same value for all loops within a function. However, that doesn't give the per-loop control we have since discovered we need for graphics applications.
Aug 6 2020
There is some more info about the motivation for this in https://github.com/GPUOpen-Drivers/llpc/pull/863
Jul 29 2020
Fixed case style of variable.
Jul 28 2020
Jun 24 2020
Jun 22 2020
Removed changes introduced only by clang-format format.
Jun 19 2020
Nov 21 2019
Nov 20 2019
Added test to confirm that the amdgpu-unroll-threshold attribute has the expected effect.
Nov 18 2019
Function attribute for loop unroll threshold default
Oct 17 2019
Changes to address review comments
Oct 11 2019
Oct 7 2019
Sep 23 2019
Aug 8 2019
Aug 5 2019
Does anyone have any remaining concerns with this as it is now?
Jul 29 2019
Simplified the LIT test
Jul 22 2019
Some cases can be undone by rematerialization, but not all, and it can involve a lot of effort which increases compile time. The metadata is a pragmatic approach which helps in some cases.
Changed the metadata name to llvm.licm.disable
Jul 15 2019
Jul 11 2019
Unfortunately I don't believe I have an example that is suitable for publishing. The problem is basically that the code motion is generally a good thing, but it can increase live ranges significantly pushing register pressure up such that performance is degraded, but that impact isn't apparent at the point at which the transformation is performed. Disabling the pass for certain loops is a low-impact pragmatic (but not ideal) solution.
Update LangRef doc
The target in this case is AMDGPU, and the problem is that transformations can create excessive register pressure ( but I don't believe this is necessarily unique to that target). The front-end has additional information available which can sometimes be used to identify such cases.
May 8 2019
Sorry not to have noticed this sooner - I was just about to make a fix myself. I chose to change the constructor to
Feb 1 2019
Jan 28 2019
Jan 24 2019
Extended llvm.amdgcn.interp.f16.ll to check that m0 is set before
each interp instruction if necessary, and added a new LIT test
to check that the interp f16 intrinsics are identified as being
Dec 18 2018
Rebased, and amended LIT test now that the required mode register
pass has been committed.
Dec 11 2018
I forgot to add the Phabricator Review to the commit - whoops!
Dec 10 2018
Nov 30 2018
Reordered the cases dealt with in Phase 1 so that the most specific
case (setreg instruction) is performed first, allowing the removal
of one condition, and reduced indentation for that case accordingly.
Removed redundant call to merge mode register status.
Nov 21 2018
Amended the declaration of NewInfo.
Fixed minor formatting issues, and amended the way mode changes are
combined into as few setreg instrcutions as possible.
Nov 12 2018
Amended SIModeRegister to address some minor points, and added comments to help explain why it appears more complex than necessary.
Refactored SIModeRegister.cpp slightly and added more comments to help explain the processing, and made a couple of minor changes to address review comments.
Nov 1 2018
I'm afraid I don't know anything about OpenCL non-default rounding modes - are they set per arithmetic operation or per function? When will these be needed?
Yes, I think that your suggestion is the correct solution in a perfect world. It is one of the possible approaches that we discussed in our team before implementing the current proposed solution.
Oct 31 2018
Fixes for observed failures:
- Corrected which instructions are marked as using the double
precision floating point rounding mode flags
- Changed the position where the first setreg in a block is
inserted in order to reduce the risk of hitting a hazard that
may exist at entry to the first block of a shader.
Aug 16 2018
Aug 13 2018
Minor amendments as per review comments.
Aug 1 2018
Updated the LIT test as per review comments.
Jul 24 2018
Removed the mode register pass, as that will be introduced as a
Jul 10 2018
Changed mode register pass to use an explicit stack instead of recursion.
Refactored pass to insert rounding mode to use a style more in line
with other LLVM passes. This fails to optimize a few corner cases,
but they are expected to occur very rarely if at all.
Jul 9 2018
A slighly more performant implementation of the pass to add any
required changes to the double precision rounding mode.
Jul 3 2018
[AMDGPU] Add intrinsics for 16 bit interpolation
May 22 2018
Change the omod operand type to be i32 rather than i1, to avoid
a build failure when building using a debug TableGen.
May 21 2018
Added a divergence LIT test for the 16 bit interp intrinsics.
May 18 2018
Or is this bit controlling the weird load from memory? The manual isn't particularly clear to me. I see mention of LDs loads, but also op_sel control of destination bits
Even without the high operand I don't think it is possible to overload interp_p1 and interp_p1_f16 as they would have identical types - there is nothing to disambiguate them.
Corrected the ordering of operands to interp_p2_f16, added lowered
intrinsics to list of those that cware a source of divergence, and
amended LIT test.
[AMDGPU] Add interpolation builtins