This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Reworked SIFixWWMLiveness
ClosedPublic

Authored by tpr on May 11 2018, 7:13 AM.

Details

Summary

I encountered some problems with SIFixWWMLiveness when WWM is in a loop:

  1. It sometimes gave invalid MIR where there is some control flow path to the new implicit use of a register on EXIT_WWM that does not pass through any def.
  1. There were lots of false positives of registers that needed to have an implicit use added to EXIT_WWM.
  1. Adding an implicit use to EXIT_WWM (and adding an implicit def just before the WWM code, which I tried in order to fix (1)) caused lots of the values to be spilled and reloaded unnecessarily.

This commit is a rework of SIFixWWMLiveness, with the following changes:

  1. Instead of considering any register with a def that can reach the WWM code and a def that can be reached from the WWM code, it now considers three specific cases that need to be handled.
  1. A register that needs liveness over WWM to be synthesized now has it done by adding itself as an implicit use to defs other than the dominant one.

Also added the following fixmes:

FIXME: We should detect whether a register in one of the above
categories is already live at the WWM code before deciding to add the
implicit uses to synthesize its liveness.

FIXME: I believe this whole scheme may be flawed due to the possibility
of the register allocator doing live interval splitting.

Change-Id: Ie7fba0ede0378849181df3f1a9a7a39ed1a94a94

Diff Detail

Repository
rL LLVM

Event Timeline

tpr created this revision.May 11 2018, 7:13 AM
tpr added reviewers: mareko, Restricted Project.May 17 2018, 12:51 PM
arsenm added inline comments.May 18 2018, 1:31 AM
lib/Target/AMDGPU/SIFixWWMLiveness.cpp
238 ↗(On Diff #146321)

Without looking at this too closely, I would like to avoid adding assumptions where the number of successors of a block is exactly 2 to avoid future pain when control flow lowering is changed. Is there a more specific property you can check instead?

tpr added inline comments.May 23 2018, 7:36 AM
lib/Target/AMDGPU/SIFixWWMLiveness.cpp
238 ↗(On Diff #146321)

It is checking for a very specific case. Currently I believe the structurizer will always generate if..then..endif as the only way of having a non-uniform phi in non-loop cases. This seems the best way to do it for now, although the whole thing is a hack that would possibly need core LLVM support to fix properly.

tpr added a comment.May 26 2018, 12:24 AM

Ping @arsenm

Is this ok?

Sorry I didn't get to this earlier, but would you mind holding off on this a little bit? I'd like to think this through.

@cwabbott Did you have a chance to look at this?

tpr added a comment.May 29 2018, 12:46 AM

OK will hold off for a bit.

I've had some time to let this sink in now.

Let me start by saying that I'm very skeptical of any approach that relies on LoopInfo or RegionInfo for correctness, and case distinctions in general.

That said, I do like the idea of getting rid of implicit uses on WWM, and instead turning defs of variables into partial defs. That fundamentally makes sense.

I also think your second FIXME is spot-on: if the register allocator does any kind of splitting / spilling inside WWM code, things are likely to get screwed up. I couldn't think of a way to fix that without going into the guts of the register allocator, so the following just ignores that problem entirely for now.

How about the following alternative logic for where partial defs are needed. Move this pass to before PHI elimination (but still after WQM), and consider all PHIs of vector registers. For every operand X of the PHI node consider its unique def D. If any of the defs of the other operands of the PHI node can reach[1] D via a WWM region, then add an appropriate implicit use to D. In the general case, this may require creating new PHI nodes to preserve SSA form, so perhaps the MachineSSAUpdater can be used for some of this (this would also help with adding IMPLICIT_DEFs where needed).

With this approach, we should be able to feel much more confident about the correctness of it all (except for the issue with register spills), and having fewer steps between PHI elimination and register allocation is always a good thing. It also preserves the nice property of your approach that we don't add excessive implicit uses.

If a normal graph walk is used to determine reachability in [1], then this is a somewhat conservative approximation in some uniform control flow cases, but I hope we can accept this as a first step. I haven't given much thought yet to how we could do better in general.

tpr added a comment.Jun 18 2018, 1:29 PM

Thanks, sounds feasible. I'll give it a go.

tpr added a comment.Jun 20 2018, 12:30 PM

Actually that plan does not work for one of the cases handled in the current change: when there is a def towards the end of a loop and a use outside the loop, and there is some WWM inside the loop. There is no phi node at the top of the loop at all for that, but we need to spot the case and introduce one (and make its value undef from the loop pre-header) because the liveness needs to go round the loop to allow for lanes that were in the loop but have become inactive because they have already decided to leave the loop.

tpr added a comment.Jun 26 2018, 12:10 AM

Ping @nhaehnle -- any comment on the flaw in your plan? Can we just go with this change as it is better than what is there now?

First, thanks for working on this issue, it's on our table since ages and I couldn't find a proper solution yet.
I tested your patch with my shader_ballot implementation on DOOM, but still get a bunch of artifacts (https://github.com/daniel-schuermann/mesa/commits/shader_ballot).
It might also be that my implementation is bugged, so I'd be glad in case you have AMDVLK branch with working

OpExtension "SPV_AMD_gcn_shader"
OpExtension "SPV_AMD_shader_ballot"
OpExtension "SPV_AMD_shader_trinary_minmax"
OpExtension "SPV_KHR_shader_ballot"
OpExtension "SPV_KHR_subgroup_vote"
OpCapability SubgroupBallotKHR
OpCapability SubgroupVoteKHR
OpCapability Group

if you let me know if DOOM works for you.

In D46756#1138135, @tpr wrote:

Actually that plan does not work for one of the cases handled in the current change: when there is a def towards the end of a loop and a use outside the loop, and there is some WWM inside the loop. There is no phi node at the top of the loop at all for that, but we need to spot the case and introduce one (and make its value undef from the loop pre-header) because the liveness needs to go round the loop to allow for lanes that were in the loop but have become inactive because they have already decided to leave the loop.

According to literature[1], there are exactly three types of phi-nodes:
γ - functions represent the joining point of different paths created by an “if-then-else” branch in the source program.
μ - functions, which only exist at loop headers, merge initial and loop-carried values.
η - functions represent values that leave a loop.

The case you are talking about requires an η phi-node after the loop exit, which should be there if we are in LCSSA-form. (If not, we'd have to lower to LCSSA.)
I think, case distinction makes sense here as these are the only variations of phi-nodes. For the η-node, we'd have to check if a Def of a phi-src can reach itself (as it's inside a loop) via a WWM region to resolve this issue.
The same might be true for the μ phi-nodes (e.g. if we have only one initial and one loop-carried value and afterwards a WWM region inside the loop, then neither the initial value reaches the loop-carried value via WWM nor the other way around).
The other cases according to @nhaehnle's proposal.

[1] Tu, Peng, and David Padua. "Efficient building and placing of gating functions." ACM SIGPLAN Notices 30.6 (1995): 47-55.

tpr added a comment.Jul 9 2018, 4:15 AM

Hi Daniel

Thanks for the comments. The LCSSA thing sounds good, except it does not overcome Nicolai's objection to relying on LoopInfo for semantic correctness.

I guess it might be possible to find the loop-exiting values as a def that reaches itself and has at least one use that does not reach the def.

Remember that all of this still has the fatal flaw that the register allocator may split a register and completely negate our attempts to add artificial liveness.

nhaehnle accepted this revision as: nhaehnle.Jul 31 2018, 5:34 AM

Agreed that LCSSA doesn't help either.

I think we have a consensus that the status quo is not ideal even with this patch (and even if my earlier suggestion could be made to work), but we also haven't made progress on how to fix this whole complex of issues properly. In the meantime, this patch does fix some bugs, so let's give it a shot.

This revision is now accepted and ready to land.Jul 31 2018, 5:34 AM
This revision was automatically updated to reflect the committed changes.