This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add new Mode Register pass
ClosedPublic

Authored by timcorringham on Aug 13 2018, 6:40 AM.

Details

Reviewers
arsenm
tpr
b-sumner
rampitec
nhaehnle
Group Reviewers
Restricted Project
Summary

A new pass to manage the Mode register.

Currently this just looks at the floating point double precision
rounding requirements, but is intended to be easily extended to
encompass all Mode register settings.

The floating point double precision rounding mode is required by
the 16 bit interpolation instructions.

Diff Detail

Event Timeline

timcorringham created this revision.Aug 13 2018, 6:40 AM
timcorringham added reviewers: arsenm, Restricted Project, tpr.Aug 13 2018, 6:44 AM
arsenm added inline comments.Aug 13 2018, 10:18 AM
lib/Target/AMDGPU/SIInstrFormats.td
124–126

It's not clear to me what this means, since every FP instruction does this

lib/Target/AMDGPU/SIModeRegister.cpp
161–181

I'm not really comfortable inserting something semantically required at this point. Can you do this when the instructions are selected instead?

172

What does this mean exactly by "needs"? Does the instruction fail to function?

225

nullptr

348–349

Since you seem to be relying on inserting these instructions, this is incorrect

368

Typo setregto

timcorringham marked 3 inline comments as done.Aug 13 2018, 12:20 PM
timcorringham added inline comments.
lib/Target/AMDGPU/SIInstrFormats.td
124–126

It is just a way of indicating to this pass that it is an instruction that uses the double precision flag.

lib/Target/AMDGPU/SIModeRegister.cpp
161–181

The problem with doing it during instruction selection is that we end up with many more mode register writes than are strictly required. As the mode register is not modelled as a register there isn't any way to track the values without a pass to do it, I suppose it is similar to adding nops or waitcnts, which are also done by specific passes.

172

The results can be outside the expected range when other rounding modes are used.

Minor amendments as per review comments.

arsenm added inline comments.Aug 15 2018, 2:08 PM
lib/Target/AMDGPU/SIModeRegister.cpp
161–181

This is a semantic property and I think it really belongs in instruction selection. What is the problem with optimizing those out here? What actually changes?

161–181

Or even earlier, a property of the emitted operation

timcorringham added inline comments.Aug 16 2018, 4:25 AM
lib/Target/AMDGPU/SIModeRegister.cpp
161–181

An initial attempt at this functionality did insert necessary mode register writes at instruction selection, but that resulted in many more changes than were necessary as the state of the register isn't known at that point. In order to avoid some of the changes all instructions that use the mode register would have to be updated to ensure the mode was appropriate - which was considered too invasive. A pass to remove unnecessary setregs would be possible, but would be very similar to this pass, but would still require changes to many instructions to insert the setregs, and would also provide extra overhead for all intervening passes. There would have to be some dependence introduced between the mode register and the instructions that use it to ensure any rescheduling didn't break the code.
Overall we thought that this approach was the best compromise. It solves the immediate problem with minimal overhead, and can be extended in a staged manner fairly easily.

Fixes for observed failures:

  • Corrected which instructions are marked as using the double

precision floating point rounding mode flags

  • Changed the position where the first setreg in a block is

inserted in order to reduce the risk of hitting a hazard that
may exist at entry to the first block of a shader.

rampitec requested changes to this revision.Oct 31 2018, 2:36 PM
rampitec added a subscriber: rampitec.

You need to model HWREG and add it as impuse to affected instructions and as imdef/out of a setreg.
That is the only correct way to protect it from rescheduling.

That way you will be able to emit setregs early and let some optimizations happen, including those to minimize setreg calls (since setreg is a big performance hit).

Moreover, you may need to split hwreg into several "registers" to track dependencies on individual bits.

This revision now requires changes to proceed.Oct 31 2018, 2:36 PM

Yes, I think that your suggestion is the correct solution in a perfect world. It is one of the possible approaches that we discussed in our team before implementing the current proposed solution.

The specific issue we are trying to solve is that the three 16 bit interpolation instructions need a non-default rounding mode. These are not yet widely used, but of course we need to ensure they work when they are used.

Modelling the Mode register as a separate register for each field would allow LLVM to track the values and minimise the number of changes required, and having a dependency to that register would avoid any issues with scheduling. To be complete we would need to add something like 14 separate registers corresponding to the fields within the Mode register, and add the dependencies to those to all the instructions that depend on the settings (lots). We would also need a pass to combine changes to separate fields into a single setreg wherever possible that would probably be something similar to the pass we have now. This feels like a rather invasive set of changes. This approach would have the advantage that it would probably also resolve the concerns Matt raised. However, we chose not to adopt this approach as we considered the cost-benefit equation to be too heavy on the cost side.

The approach we have implemented is a compromise that meets our current needs, is extendable for other mode settings should that become necessary, and isn't too invasive. It produces a minimal number of setreg instructions in almost all cases. Running the pass late avoids scheduling issues, but does possibly miss some minor optimization opportunities. However, given the rare occurrence of non-default modes the impact is very small.

Do you think the benefits of the multi-register approach justify the effort required over the current approach?

One thing we've wanted for compute for quite a while now is a way to request non-default-rounded add, sub, mul, div, fma, and sqrt. Assuming we ever figure out how to represent these in the IR, ideally without falling back on intrinsics, could this approach be used to implement and minimize the mode changes for those as well?

Yes, I think that your suggestion is the correct solution in a perfect world. It is one of the possible approaches that we discussed in our team before implementing the current proposed solution.

The specific issue we are trying to solve is that the three 16 bit interpolation instructions need a non-default rounding mode. These are not yet widely used, but of course we need to ensure they work when they are used.

Modelling the Mode register as a separate register for each field would allow LLVM to track the values and minimise the number of changes required, and having a dependency to that register would avoid any issues with scheduling. To be complete we would need to add something like 14 separate registers corresponding to the fields within the Mode register, and add the dependencies to those to all the instructions that depend on the settings (lots). We would also need a pass to combine changes to separate fields into a single setreg wherever possible that would probably be something similar to the pass we have now. This feels like a rather invasive set of changes. This approach would have the advantage that it would probably also resolve the concerns Matt raised. However, we chose not to adopt this approach as we considered the cost-benefit equation to be too heavy on the cost side.

The approach we have implemented is a compromise that meets our current needs, is extendable for other mode settings should that become necessary, and isn't too invasive. It produces a minimal number of setreg instructions in almost all cases. Running the pass late avoids scheduling issues, but does possibly miss some minor optimization opportunities. However, given the rare occurrence of non-default modes the impact is very small.

Do you think the benefits of the multi-register approach justify the effort required over the current approach?

That is not only few interpolation instructions which need it. We need to implement OpenCL non-default rounding modes for arithmetic instructions. That will require the use of setregs.
In fact I do not see a non-invasive or efficient way to implement OpenCL rounding modes without proper modeling of HWREG and dependencies, because lowering of intrinsics must occur early.
That means we will need to revert any late approach if submitted and reimplement it any way. I.e. I believe this effort is perfectly justified.

I'm afraid I don't know anything about OpenCL non-default rounding modes - are they set per arithmetic operation or per function? When will these be needed?

I'm afraid I don't know anything about OpenCL non-default rounding modes - are they set per arithmetic operation or per function? When will these be needed?

These are set per operation. For example one could use a builtin like convert_int4_rte() or convert_float4_rtp() to perform a conversion with a non-default rounding mode.
And yes, we need them.

b-sumner added a comment.EditedNov 1 2018, 12:40 PM

Actually the conversions don't need non-default-rounded operations, nor are non-default-rounded arithmetic operations required by OpenCL. However, we've had requests to implement functions such as add_rtz(x,y) which computes x+y with round-to-zero rounding. Our competitors offer such functions, and we implemented them for HSAIL. So we are really trying to get back to parity with HSAIL.

As an overall algorithmic remark: I like the organization of the pass into phases, because it provides a path forward to an additional optimization.

The current approach could be described as a straightforward fixed-point iteration over a lattice describing modes at each instruction that uses a mode. This is good and a perfectly fine approach.

However, consider the case of a loop that consistently uses a specific mode setting inside the loop, while code before the loop uses a different mode setting. The current code will introduce an s_setreg_imm32_b32 inside the loop, since the predecessor state at the beginning of the loop header is unknown. Instead, we could add an s_setreg_imm32_b32 at the end of the loop predecessor block.

The organization into separate passes allows a later modification of pass 2 & 3 into something more advanced that can take this possibility into account. I'm not saying that this optimization should necessarily be done with this change, but I just want people to be aware of it; also, it serves as an additional explanation for one of my inline comments :)

lib/Target/AMDGPU/SIModeRegister.cpp
91–93

Technically not required on entry, but required at the FirstInsertionPoint.

120

Use smart pointers (unique_ptr should do the trick).

161–181

I tend to agree with Tim here. If we first emit setregs everywhere just to remove most of them later, the pass that removes them will look pretty much identical to this pass, while in the meantime slowing down code generation elsewhere, and possibly even pessimizing things (e.g. restricting machine scheduling).

238–258

I feel like this logic is more convoluted than necessary, and possibly even wrong / overly conservative in some cases because of that.

For example, why are you setting InsertionPoint also in the case where FirstInsertionPoint is set? Why are IPChange and NewInfo->Change never cleared? My intuition is that the logic should just be a delayed-update/insert pattern like this:

if (!NewInfo->FirstInsertionPoint) {
  NewInfo->FirstInsertionPoint = &MI;
} else if (!IPChange.isCompatible(InstrMode)) {
  if (InsertionPoint)
    insertSetReg(IPChange);
  else
    NewInfo->Require = IPChange;
  merge IPChange into NewInfo->Change and reset IPChange
  InsertionPoint = &MI;
}
merge InstrMode into IPChange

Then at the end of the basic block:

if (!InsertionPoint)
  NewInfo->Require = IPChange;

The RequirePending flag should simply be unnecessary.

273–277

It would be more intuitive to guard this by IPChange being "non-empty". In the InsertionPoint case, insert the SETREG; otherwise, set NewInfo->Require. In both cases, IPChange should be reset.

The overall logic can then be described as:

  • NewInfo->Change describes the current status of the mode registers as we know it
  • IPChange describes the pending mode changes that need to be applied at InsertionPoint (if non-null) or NewInfo->Require (otherwise)
297

Remove this from here, it isn't conceptually part of phase 1. Initialize that list as part of external driver code.

rampitec resigned from this revision.Nov 5 2018, 12:05 PM

OK, I think the patch does not affect a future implementation. I still do believe we need to lower it early for the compute purposes, but it can be done later.
Temporarily resigning to remove the vote.

rampitec removed reviewers: Restricted Project, rampitec.Nov 5 2018, 12:16 PM
rampitec added reviewers: Restricted Project, rampitec.
timcorringham marked 3 inline comments as done.

Refactored SIModeRegister.cpp slightly and added more comments to help explain the processing, and made a couple of minor changes to address review comments.

Amended SIModeRegister to address some minor points, and added comments to help explain why it appears more complex than necessary.

lib/Target/AMDGPU/SIModeRegister.cpp
91–93

Yes, I forgot to change this comment when I changed the code to insert at the FirstInsertionPoint rather than at the start of the block.

238–258

There are cases where we don't set FirstInsertionPoint, even if there are other InsertionPoints. This can arise where an explicit setreg appears before the first instruction that uses the mode register. We preserve the setreg (we don't really expect to see any, but if they appear we assume there is a good reason for it). In that case there is no initial mode value requirement, so no FirstInsertionPoint. That is also the reason for the RequirePending flag - there are more states than can be deduced by just the InsertionPoint and FirstInsertionPoint pointers.
We don't clear Change as the algorithm assumes it holds the net change to the mode by the block. When we know the predecessor mode(s) in Phase 2 we can then determine the output mode of each block (this can involve revisiting blocks that are successors to any block that changes its output mode).
Phase 3 then determines whether a setreg is required at the FirstInsertionPoiint.

273–277

The status values aren't designed to work quite the way you assume. I have refactored the code slightly and improved the comments - does that help at all?

The update seems to have messed up the indentation of comments in a few places.

lib/Target/AMDGPU/SIModeRegister.cpp
227

auto NewInfo = llvm::make_unique<BlockData>();

238–258

Okay, I see the point about having a pre-existing setreg before any instruction with mode requirements.

However, the point about clearing Change (or rather IPChange) still stands, because there are many different mode bits that could have requirements separately. For example, you could have:

1. Inst that has f32 round & denormal requirements
2. Inst that has f16 round & denormal requirements
3. Inst that has different f32 round & denormal req.s
4. Inst that has different f16 round & denormal req.s

You really only need to insert two setregs (before 1 and 3), but the algorithm will insert setregs before 1, 3, and 4.

Fixed minor formatting issues, and amended the way mode changes are
combined into as few setreg instrcutions as possible.

timcorringham marked an inline comment as done.

Amended the declaration of NewInfo.

Thank you. One question left though.

lib/Target/AMDGPU/SIModeRegister.cpp
244–247

Hasn't this become redundant now?

nhaehnle added inline comments.Nov 29 2018, 3:34 AM
lib/Target/AMDGPU/SIModeRegister.cpp
244–247

Okay, no, I see how it isn't redundant if there's a pre-existing s_setreg.

But why the call to merge? isCompatible only returns true if InstrMode is a subset of the known bits of NewInfo->Change.

Removed redundant call to merge mode register status.

timcorringham marked 2 inline comments as done.Nov 30 2018, 4:21 AM
timcorringham added inline comments.
lib/Target/AMDGPU/SIModeRegister.cpp
244–247

Good spot- this is now redundant. It is benign but unnecessary - I'll remove it.

One last thing. If I'm right about this, it'd be good to reduce the check complexity and indentation levels. Then I'm happy :)

lib/Target/AMDGPU/SIModeRegister.cpp
244–247

Nice. Isn't the InstrMode != Status() check now also redundant, actually?

timcorringham marked an inline comment as done.

Reordered the cases dealt with in Phase 1 so that the most specific
case (setreg instruction) is performed first, allowing the removal
of one condition, and reduced indentation for that case accordingly.

timcorringham marked an inline comment as done.Nov 30 2018, 8:34 AM
nhaehnle accepted this revision.Dec 7 2018, 7:44 AM

Oops, I thought I'd posted this earlier... LGTM.

This revision is now accepted and ready to land.Dec 7 2018, 7:44 AM
timcorringham closed this revision.Dec 11 2018, 3:11 AM

I forgot to add the Phabricator Review to the commit - whoops!

Commited by:

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@348754 91177308-0d34-0410-b5e6-96231b3b80d8

and subsequent minor fix for a sanitizer issue:

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@348767 91177308-0d34-0410-b5e6-96231b3b80d8