- User Since
- Apr 26 2017, 9:47 AM (116 w, 1 d)
Mon, Jul 15
Thu, Jul 11
Unfortunately I don't believe I have an example that is suitable for publishing. The problem is basically that the code motion is generally a good thing, but it can increase live ranges significantly pushing register pressure up such that performance is degraded, but that impact isn't apparent at the point at which the transformation is performed. Disabling the pass for certain loops is a low-impact pragmatic (but not ideal) solution.
Update LangRef doc
The target in this case is AMDGPU, and the problem is that transformations can create excessive register pressure ( but I don't believe this is necessarily unique to that target). The front-end has additional information available which can sometimes be used to identify such cases.
May 8 2019
Sorry not to have noticed this sooner - I was just about to make a fix myself. I chose to change the constructor to
Feb 1 2019
Jan 28 2019
Jan 24 2019
Extended llvm.amdgcn.interp.f16.ll to check that m0 is set before
each interp instruction if necessary, and added a new LIT test
to check that the interp f16 intrinsics are identified as being
Dec 18 2018
Rebased, and amended LIT test now that the required mode register
pass has been committed.
Dec 11 2018
I forgot to add the Phabricator Review to the commit - whoops!
Dec 10 2018
Nov 30 2018
Reordered the cases dealt with in Phase 1 so that the most specific
case (setreg instruction) is performed first, allowing the removal
of one condition, and reduced indentation for that case accordingly.
Removed redundant call to merge mode register status.
Nov 21 2018
Amended the declaration of NewInfo.
Fixed minor formatting issues, and amended the way mode changes are
combined into as few setreg instrcutions as possible.
Nov 12 2018
Amended SIModeRegister to address some minor points, and added comments to help explain why it appears more complex than necessary.
Refactored SIModeRegister.cpp slightly and added more comments to help explain the processing, and made a couple of minor changes to address review comments.
Nov 1 2018
I'm afraid I don't know anything about OpenCL non-default rounding modes - are they set per arithmetic operation or per function? When will these be needed?
Yes, I think that your suggestion is the correct solution in a perfect world. It is one of the possible approaches that we discussed in our team before implementing the current proposed solution.
Oct 31 2018
Fixes for observed failures:
- Corrected which instructions are marked as using the double
precision floating point rounding mode flags
- Changed the position where the first setreg in a block is
inserted in order to reduce the risk of hitting a hazard that
may exist at entry to the first block of a shader.
Aug 16 2018
Aug 13 2018
Minor amendments as per review comments.
Aug 1 2018
Updated the LIT test as per review comments.
Jul 24 2018
Removed the mode register pass, as that will be introduced as a
Jul 10 2018
Changed mode register pass to use an explicit stack instead of recursion.
Refactored pass to insert rounding mode to use a style more in line
with other LLVM passes. This fails to optimize a few corner cases,
but they are expected to occur very rarely if at all.
Jul 9 2018
A slighly more performant implementation of the pass to add any
required changes to the double precision rounding mode.
Jul 3 2018
[AMDGPU] Add intrinsics for 16 bit interpolation
May 22 2018
Change the omod operand type to be i32 rather than i1, to avoid
a build failure when building using a debug TableGen.
May 21 2018
Added a divergence LIT test for the 16 bit interp intrinsics.
May 18 2018
Or is this bit controlling the weird load from memory? The manual isn't particularly clear to me. I see mention of LDs loads, but also op_sel control of destination bits
Even without the high operand I don't think it is possible to overload interp_p1 and interp_p1_f16 as they would have identical types - there is nothing to disambiguate them.
Corrected the ordering of operands to interp_p2_f16, added lowered
intrinsics to list of those that cware a source of divergence, and
amended LIT test.
[AMDGPU] Add interpolation builtins
May 15 2018
May 11 2018
Apr 4 2018
Mar 26 2018
Added support for disassembly of arbitrary size sections
Mar 21 2018
Mar 20 2018
This implements Bug 36347
Feb 21 2018
Looks good to me.
Feb 5 2018
Matt, we do actually need these intrinsics as we have an urgent requirement for them Open Vulkan (which is of course my motivation for implementing them).
Feb 3 2018
Dec 4 2017
Nov 28 2017
May 25 2017
May 18 2017
Updated s_getpc instruction definition to include intrinsic.
May 17 2017
Amendments addressing review comments:
May 15 2017
[AMDGPU] add intrinsic for s_getpc
May 8 2017
Even then that's a pretty big assumption relying on the high 32-bits. What are you doing with the address? This might be better served by something more targeted