This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU. Divergence driven ISel. Assign register class for cross block values according to the divergence.
ClosedPublic

Authored by alex-t on Mar 29 2019, 6:40 AM.

Details

Summary

To make instruction selection really divergence driven it is necessary to assign the correct register classes to the cross block values beforehand.
For the divergent targets same value type requires different register classes dependent on the value divergence.
For example uniform i32 is SReg_32RegClass but the divergent one is VReg_32RegClass.
Unfortunately, TargetLowering::getRegClassFor function relies on the simple array indexed by the value types.
Hence we only have one register class for the concrete value type. To workaround this I had to override this method in the target.
I also add the boolean argument to designate the value divergence.

This review has passed precheckin.

However it is created as a starting point for the wider discussion to elaborate the best approach.

Diff Detail

Event Timeline

alex-t created this revision.Mar 29 2019, 6:40 AM
alex-t updated this revision to Diff 192828.Mar 29 2019, 8:15 AM
alex-t edited the summary of this revision. (Show Details)
rampitec added inline comments.Mar 29 2019, 4:50 PM
include/llvm/CodeGen/FunctionLoweringInfo.h
60

Please follow formatting rules with pointers, in all places.

include/llvm/CodeGen/TargetLowering.h
647

Indent.
Also please write a comment what is this function about.

include/llvm/CodeGen/TargetRegisterInfo.h
523

80 chars per line.
Also please write a comment what is this function about.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13921–13922

Can you run clang-format please?

lib/CodeGen/SelectionDAG/FunctionLoweringInfo.cpp
382

80 chars.

lib/CodeGen/SelectionDAG/InstrEmitter.cpp
229

Switch the condition order. Node->isDicergent() is less expensive.

404

Switch the condition order.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
636

!isSGPRReg()

lib/Target/AMDGPU/SIISelLowering.cpp
10169

!isSGPRClass()

lib/Target/AMDGPU/SIRegisterInfo.h
200

!isSGPRClass()

alex-t updated this revision to Diff 193265.Apr 2 2019, 5:58 AM

changed according the reviewer request

alex-t marked 10 inline comments as done.Apr 2 2019, 6:00 AM
alex-t added a reviewer: efriedma.
rampitec added inline comments.Apr 2 2019, 10:55 AM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
635

I think you can decrease nesting here. At the very least join individual "if" conditions with "&&".

efriedma added inline comments.Apr 2 2019, 12:25 PM
include/llvm/CodeGen/TargetLowering.h
648

I'm confused what this means. A value is either divergent, or not divergent, and the premise of this patch is that it isn't appropriate to put a divergent value in a uniform register. But this patch forces the value into a uniform register anyway?

lib/CodeGen/SelectionDAG/InstrEmitter.cpp
592

Weird indentation.

lib/Target/AMDGPU/SIISelLowering.cpp
10207

It's probably not a good idea to traverse the use-list of anything that isn't an instruction here; you could end up finding uses in a different function.

I don't really understand what you're trying to do here; is this going to force an arbitrary tree of instructions into uniform registers?

alex-t marked 2 inline comments as done.Apr 3 2019, 4:34 AM
alex-t added inline comments.
include/llvm/CodeGen/TargetLowering.h
648

There is no direct one-to-one mapping between the divergence of the high level IR value and target specific register class.
So, each target need the specific hook for target specific selection of the register class for the given value.
And yes, I agree that the function name is misleading. I should rename it.
For instance: we have naturally divergent value that is lowered to 64 bits individual for each thread in the wave front.
Obviously we want it to live in the 64 bit scalar register.

lib/Target/AMDGPU/SIISelLowering.cpp
10207

Hmm... I was pretty sure that the UseList is per function... it really is the same for the whole module? It looks weird to me.

As for what I'm doing here: the whole program slice that produces/consumes 64 bit mask for the SI Control Flow staff should be forced to the SGPRs. It is NOT arbitrary set. I start from the value in question and DFS on the def-use tree until I meet SI_IF_BREAK or to the end.
If I meet SI_IF_BREAK whole the tree requires SGPRs for concrete uses.

Let's say we have some instruction that produce the value, SI_IF_BREAK that finally uses the derived value and some another instruction in between that is naturally VALU. If this VALU instruction description does not allow the SGPR on this exact operand position we'll have to insert a SGPR to VGPR COPY operation.

Once more it is target specific interpretation. Divergence of the high level value does not strictly require it to be assigned to VGPR.
It may be 64 bit mask.

efriedma added inline comments.Apr 3 2019, 11:55 AM
lib/Target/AMDGPU/SIISelLowering.cpp
10207

Yes, use-lists are global, in general. For Instructions and Arguments specifically, all the uses are required to be in the same function as the definition, though, so maybe you're fine here.

If I'm following correctly, amdgcn_if_break are specialized intrinsics which are generated just before isel, and there's a specific set of PHI nodes and intrinsics that need to remain in SGPRs? I'm not quite sure I follow why you can't solve the issue by making divergence analysis treat amdgcn_if_break as a non-divergent operation. But I'll let a reviewer more familiar with the target handle that.

Given the way the input is currently structured, the check is reasonable, I guess.

alex-t marked an inline comment as done.Apr 4 2019, 7:50 AM
alex-t added inline comments.
lib/Target/AMDGPU/SIISelLowering.cpp
10207

Why we cannot just add the exception to the DA for SI_IF_BREAK?

Typically all the CF intrinsics are connected by the 64bit mask that is defined by one (SI_IF for instance) and then used by another (SI_Else and SI_END_CF for instance). This mask always required to reside in 64 bit SGPR.
Same time the condition is usual boolean value and can be either divergent or not.
If we force the whole intrinsic to be uniform we would spoil the divergence propagation for those values that are control dependent on the divergent CF.

And that;s why the following code:

if ((Intrinsic->getIntrinsicID() == Intrinsic::amdgcn_if_break) &&
     (V == U->getOperand(1)))
   Result = true;

explicitly checks which operand is the use.

efriedma added inline comments.Apr 4 2019, 12:04 PM
lib/Target/AMDGPU/SIISelLowering.cpp
10207

Okay, that makes sense.

Adding llvm-commits to the CC. Please be more careful about that in the future... see http://llvm.org/docs/Phabricator.html

alex-t added a comment.Apr 5 2019, 1:41 AM

Adding llvm-commits to the CC. Please be more careful about that in the future... see http://llvm.org/docs/Phabricator.html

Okay, thanks a lot :)

From the point of view of the design of all these interface, It's too bad we can't fix this in post. From an overall standpoint, it's actually better to get the register classes from the beginning, so sure, let's go with this kind of approach.

After a bit of reflection, I think that in part what's happening here is that the uniform/divergent axis and the register bank axis is getting confused. See @efriedma's question and my comment on isDivergentRegClass. I wonder if some parts of this change would not be better expressed in terms of register banks. For example, in the uses of isDivergentRegClass, aren't we really looking for another register class in the same bank (SGPR vs. VGPR)? Similarly, maybe requiresUniformRegister can be rephrased as returning a required register bank (and nullptr by default)?

Why do we still need to move PHIs to VALU after this change? (Looking at SIFixSGPRCopies) Shouldn't the PHI be selected with the correct register class already? What's an example where this doesn't happen?

How are values handled which are uniform inside a loop but divergent for outside uses due to a divergent exit condition?

include/llvm/CodeGen/TargetRegisterInfo.h
524–526

This function is problematic because we can't actually tell for a given register class whether the underlying value is divergent or not. Specifically, 64-bit SGPRs can be either uniform or divergent depending on whether it's the lowering of an i1 or an i64.

lib/CodeGen/SelectionDAG/InstrEmitter.cpp
556

The formatting looks off here.

lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp
726 ↗(On Diff #193265)

This looks like it belongs into a separate patch.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
630

const auto &

lib/Target/AMDGPU/SIISelLowering.cpp
10167–10168

Can we not keep uniform i1 in 32-bit registers? Maybe at least mark this as a TODO?

alex-t added a comment.EditedApr 8 2019, 8:10 AM

After a bit of reflection, I think that in part what's happening here is that the uniform/divergent axis and the register bank axis is getting confused. See @efriedma's question and my comment on isDivergentRegClass. I wonder if some parts of this change would not be better expressed in terms of register banks. For example, in the uses of isDivergentRegClass, aren't we really looking for another register class in the same bank (SGPR vs. VGPR)? Similarly, maybe requiresUniformRegister can be rephrased as returning a required register bank (and nullptr by default)?

Why do we still need to move PHIs to VALU after this change? (Looking at SIFixSGPRCopies) Shouldn't the PHI be selected with the correct register class already? What's an example where this doesn't happen?

Trying to answer all the above now... The reason for isDivergentRegClass hook and moving PHIs to VALU is the same.
Divergence driven ISel would work fine in case we really have SALU alternative for every VALU instruction. Unfortunately we have no.
This leads to insertion unnecessary moves and v_readfirstlane around the naturally uniform code.

Let's imagine we have a uniform loop that multiplies and adds floating point array elements.
We only have VALU floating point v_fmad that accepts VGPRs and produce VGPR. Yes all the values are uniform but they still need to be in VGPRs!
If we keep PHI in the loop header uniform we would end up moving SGPRs to VGPRs before v_fma and v_readfirstlane the resulting VGPR to just make it go round the loop to be moved back to VGPR.

So, we need some interface to query if the given register class is considered uniform or divergent in the given target.

Look how we use it:

const TargetRegisterClass *RC =
  TRI->getAllocatableClass(TII->getRegClass(II, i, TRI, *MF));
   
if (i < NumResults && TLI->isTypeLegal(Node->getSimpleValueType(i))) {
  const TargetRegisterClass *VTRC = TLI->getRegClassFor(
      Node->getSimpleValueType(i),
      Node->isDivergent() || (RC && TRI->isDivergentRegClass(RC)));

RC above - register class retrieved from the instruction description. That means we must have VGPR operand in this concrete position.
Thus, the condition in "getRegClassFor" is really "if SDNode is divergent itself or we have to assign VGPR because we only have VALU form of the instruction".

Same for the PHIs. By the point SIFixSGPRCopies works we already selected everything according the divergence.
If we have uniform PHI with VGPR input or the uniform PHIs user requires VGPR that means we just have no SALU form for the defining instruction or the user instruction.
In this case we have to convert PHI back to VALU.

I understand that my solution looks disgusting. And yes I'm thinking of further changing the getRegClassFor interface to incorporate all the target related hacks to the target specific code.
The main problem is that in several places in LLVM core code the getRegClassFor is called from the context that only have type or register class but has no Value.

alex-t marked an inline comment as done.Apr 8 2019, 8:49 AM
alex-t added inline comments.
include/llvm/CodeGen/TargetRegisterInfo.h
524–526

This is not about the underlying value at all.
This is a way to ask the target does it consider given register class as uniform or divergent.
In other words: we cannot expose the concrete register class properties to the common code.
From the other hand, the instruction description structure is common and it maps operand to register class.
While emitting the instruction we want to consult the target if the given operand required to be assigned the divergent (aka VGPR) register. This is not because of the value divergence but because the selected instruction.

Okay, you've convinced me. I only hope we can move forward with GlobalISel and do it right there.

There are still some formatting issues, but apart from that I think the patch is good.

include/llvm/CodeGen/TargetRegisterInfo.h
524–526

Another way to look at it is that my misunderstanding of the point of the function is precisely why the name is so misleading :)

alex-t updated this revision to Diff 199406.May 14 2019, 5:05 AM

Added fixes after extended testing. Also GFX10 related update.

LGTM apart from a bunch of formatting issues. I haven't marked all of them, please just run clang-format or clang-format-diff.

include/llvm/CodeGen/TargetLowering.h
646–647

Please run clang-format.

alex-t updated this revision to Diff 199583.May 15 2019, 5:13 AM

formatting etc

rampitec added inline comments.May 15 2019, 8:14 AM
include/llvm/CodeGen/FunctionLoweringInfo.h
246

Formatting.

248

Formatting.

lib/CodeGen/SelectionDAG/InstrEmitter.cpp
293–294

Formatting.

592

Formatting.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
635

You can still decrease nesting.

lib/Target/AMDGPU/SIISelLowering.cpp
9645–9646

Formatting.

lib/Target/AMDGPU/SIISelLowering.h
370–373

Formatting.

lib/Target/ARM/ARMISelLowering.cpp
1432–1434

Formatting.

lib/Target/ARM/ARMISelLowering.h
459–460

Formatting.

alex-t updated this revision to Diff 200219.May 20 2019, 1:36 AM
alex-t marked 11 inline comments as done.

more formatting + new test updated

LGTM. Let's finish with internal integration and testing before proceeding.

This revision is now accepted and ready to land.May 23 2019, 11:39 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMay 24 2019, 8:33 AM
pcc added a subscriber: pcc.May 24 2019, 6:51 PM

Hi Alexander, unfortunately I needed to revert this in rL361688 because it broke two sanitizer bots.

Hi there,

This change introduces a regression with RADV, all dEQP-VK.subgroups.arithmetic.framebuffer.* are failing now.
Can someone look into this?
Thanks!

Hi there,

This change introduces a regression with RADV, all dEQP-VK.subgroups.arithmetic.framebuffer.* are failing now.
Can someone look into this?
Thanks!

Yes. Sure.
I'll try to fix this in 3 days and revert if won't succeed.

alex-t added a comment.EditedMay 29 2019, 9:34 AM

Hi there,

This change introduces a regression with RADV, all dEQP-VK.subgroups.arithmetic.framebuffer.* are failing now.
Can someone look into this?
Thanks!

I have a patch that fixes the issue in another test suite. Could you please suggest how to check if it also fixes RADV?
https://reviews.llvm.org/D62614

Hi there,

This change introduces a regression with RADV, all dEQP-VK.subgroups.arithmetic.framebuffer.* are failing now.
Can someone look into this?
Thanks!

I have a patch that fixes the issue in another test suite. Could you please suggest how to check if it also fixes RADV?
https://reviews.llvm.org/D62614

This patch fixes the CTS failures on my side. I have just tried the latest version.

Hi there,

This change introduces a regression with RADV, all dEQP-VK.subgroups.arithmetic.framebuffer.* are failing now.
Can someone look into this?
Thanks!

I have a patch that fixes the issue in another test suite. Could you please suggest how to check if it also fixes RADV?
https://reviews.llvm.org/D62614

This patch fixes the CTS failures on my side. I have just tried the latest version.

Err, only a subset is fixed actually.

A number of Mesa piglit tests are also affected at least on Bonaire (but it seems to not be GPU-specific, I haven't had a chance to look at it further).

- bin/ext_transform_feedback-order elements triangles
- bin/ext_transform_feedback-order elements points
- bin/ext_transform_feedback-order elements lines
- bin/ext_transform_feedback-order arrays triangles>
- bin/ext_transform_feedback-order arrays points
- bin/ext_transform_feedback-order arrays lines
- arb_clear_buffer_object-formats (96-bit clears)

It's unclear whether the regression is caused by this particular commit or by the subsequent ASAN fix.

alex-t added a comment.Jun 4 2019, 5:07 AM

Hi there,

This change introduces a regression with RADV, all dEQP-VK.subgroups.arithmetic.framebuffer.* are failing now.
Can someone look into this?
Thanks!

I have a patch that fixes the issue in another test suite. Could you please suggest how to check if it also fixes RADV?
https://reviews.llvm.org/D62614

This patch fixes the CTS failures on my side. I have just tried the latest version.

Err, only a subset is fixed actually.

I have updated the change ttps://reviews.llvm.org/D62614 this Sunday.
The new one takes completely different approach. I'd appreciate very much If you could try it.

alex-t added a comment.Jun 4 2019, 5:11 AM

A number of Mesa piglit tests are also affected at least on Bonaire (but it seems to not be GPU-specific, I haven't had a chance to look at it further).

- bin/ext_transform_feedback-order elements triangles
- bin/ext_transform_feedback-order elements points
- bin/ext_transform_feedback-order elements lines
- bin/ext_transform_feedback-order arrays triangles>
- bin/ext_transform_feedback-order arrays points
- bin/ext_transform_feedback-order arrays lines
- arb_clear_buffer_object-formats (96-bit clears)

It's unclear whether the regression is caused by this particular commit or by the subsequent ASAN fix.

Could you please try the newest fix that prevent SGPR to VGPR copies sinking out of the loop?
If it does not help I will revert the change.

I have updated the change ttps://reviews.llvm.org/D62614 this Sunday.
The new one takes completely different approach. I'd appreciate very much If you could try it.

D62614 doesn't fix the issue.

alex-t added a comment.Jun 5 2019, 2:06 AM

I have updated the change ttps://reviews.llvm.org/D62614 this Sunday.
The new one takes completely different approach. I'd appreciate very much If you could try it.

D62614 doesn't fix the issue.

Okay. I'm about to start partial revert of the change.
Could you please provide me test cases so that I can check if my further fixes help.

I have updated the change ttps://reviews.llvm.org/D62614 this Sunday.
The new one takes completely different approach. I'd appreciate very much If you could try it.

D62614 doesn't fix the issue.

Okay. I'm about to start partial revert of the change.
Could you please provide me test cases so that I can check if my further fixes help.

See below the good and bad outputs for one CTS failure:

GOOD: https://hastebin.com/muwuwivofu
BAD: https://hastebin.com/gofawejoku

Thanks again for looking into this.

Note that this change also breaks https://bugs.freedesktop.org/show_bug.cgi?id=110811

I have updated the change ttps://reviews.llvm.org/D62614 this Sunday.
The new one takes completely different approach. I'd appreciate very much If you could try it.

D62614 doesn't fix the issue.

Okay. I'm about to start partial revert of the change.
Could you please provide me test cases so that I can check if my further fixes help.

See below the good and bad outputs for one CTS failure:

GOOD: https://hastebin.com/muwuwivofu
BAD: https://hastebin.com/gofawejoku

Thanks again for looking into this.

Note that this change also breaks https://bugs.freedesktop.org/show_bug.cgi?id=110811

I investigated the failed case. The reason is again in use of the value that is uniform inside the loop but the loop has divergent exit.
We rely on LCSSA PHIs to handle this. Unfortunately, Early CSE pass mistakenly removes them.
I have one line fix and the review for it: https://reviews.llvm.org/D63489
Could you possibly check if it helps in this particular case? If yes it maybe worth checking others...