This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Scheduler: fix RP calculation for a MBB with one successor
ClosedPublic

Authored by vpykhtin on Oct 28 2022, 12:02 AM.

Details

Summary

We reuse live registers after tracking one MBB as live-ins to the successor MBB
if the successor is only one but we don't check if the successor has other predecessors.

A B
\ /
C

A and B have one successor but C has live-ins defined by A and B and therefore should be
initialized using LIS.

This fixes 83 lit tests out if 420 with EXPENSIVE_CHECK enabled.

Diff Detail

Event Timeline

vpykhtin created this revision.Oct 28 2022, 12:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 28 2022, 12:02 AM
vpykhtin requested review of this revision.Oct 28 2022, 12:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 28 2022, 12:02 AM
vpykhtin edited the summary of this revision. (Show Details)Oct 28 2022, 12:05 AM
vpykhtin edited the summary of this revision. (Show Details)
foad added a comment.Oct 28 2022, 2:29 AM

I don't understand this. Why would the liveouts of A and B be different from the liveins of C?

vpykhtin added a comment.EditedOct 28 2022, 9:45 AM

I don't understand this. Why would the liveouts of A and B be different from the liveins of C?

You're right - live-outs of A and B are live-ins to C, but in this case we're trying to use live-outs of only A or only B as live-ins to C.

This code is an optimization for the case when a MBB has one successor (which in turn has the MBB as a single predecessor) - this allows to reuse liveouts as liveins and continue tracking without RP reset.

foad added a comment.Oct 28 2022, 10:15 AM

I don't understand this. Why would the liveouts of A and B be different from the liveins of C?

You're right - live-outs of A and B are live-ins to C, but in this case we're trying to use live-outs of only A or only B as live-ins to C.

No, I mean that the live-outs of A should be identical to the live-outs of B. And both should be identical to the live-ins of C. That's how liveness works.

vpykhtin abandoned this revision.Oct 28 2022, 10:20 AM

I don't understand this. Why would the liveouts of A and B be different from the liveins of C?

You're right - live-outs of A and B are live-ins to C, but in this case we're trying to use live-outs of only A or only B as live-ins to C.

No, I mean that the live-outs of A should be identical to the live-outs of B. And both should be identical to the live-ins of C. That's how liveness works.

Agree, this patch hides other bug.

vpykhtin added a comment.EditedOct 29 2022, 2:51 AM

There're some difficulties with LIS however, example (reduced):

0B bb.0.entry:    successors: %bb.4, %bb.1;
...
320B	  S_CBRANCH_VCCNZ %bb.4, implicit killed $vcc

336B bb.1:	; predecessors: %bb.0, successors: %bb.2;
352B	  ...
368B	  undef %47.sub2:vreg_96 = IMPLICIT_DEF

432B bb.2.Flow:	; predecessors: %bb.4, %bb.1,  successors: %bb.3, %bb.5;
512B	  ...
...
624B	  S_BRANCH %bb.3

640B bb.3.if:    	; predecessors: %bb.2, successors: %bb.5;
656B	  ...
672B	  ...
720B	  %47:vreg_96 = IMAGE_SAMPLE_V3_V2 %48:vreg_64, ...
832B	  S_BRANCH %bb.5

848B bb.4.else:       ; predecessors: %bb.0, successors: %bb.2;
864B	  ...
...
928B	  %47:vreg_96 = IMAGE_SAMPLE_V3_V2 %40:vreg_64...
...
1072B	  S_BRANCH %bb.2

1088B bb.5.endif:	; predecessors: %bb.2, %bb.3
1136B	  ...
1168B	  EXP_DONE 0, %47.sub0:vreg_96, %47.sub1:vreg_96, %47.sub2:vreg_96, %49:vgpr_32, -1, 0, 15, implicit $exec
1184B	  S_ENDPGM 0


Live intervals for %47

%47 [368r,432B:3)[432B,640B:1)[720r,848B:0)[928r,1088B:4)[1088B,1168r:2) 0@720r 1@432B-phi 2@1088B-phi 3@368r 4@928r  

L0000000000000030 [368r,432B:3)[432B,640B:1)[720r,848B:0)[928r,1088B:4)[1088B,1168r:2) 0@720r 1@432B-phi 2@1088B-phi 3@368r 4@928r  
L000000000000000C [432B,640B:1)[720r,848B:0)[928r,1088B:4)[1088B,1168r:2) 0@720r 1@432B-phi 2@1088B-phi 3@x 4@928r  
L0000000000000003 [432B,640B:1)[720r,848B:0)[928r,1088B:4)[1088B,1168r:2) 0@720r 1@432B-phi 2@1088B-phi 3@x 4@928r

bb.2.Flow has two predecessors bb.4 and bb.1:

  • at the end of bb.1 %47 has live mask 0x30 - L0000000000000030 [368r,432B:3) due to undef %47.sub2:vreg_96 = IMPLICIT_DEF
  • at the beginning of bb.2.Flow %47 has 0x3F lane mask due to 928B %47:vreg_96 = IMAGE_SAMPLE_V3_V2 ...

In this example we're tracking bb.1 first and going to continue in bb.2.Flow with the liveouts left from bb.1 but liveout for %47 has 0x30 mask and livein should have 0x3F.

I'm not sure if the undef definition with subreg should be treated as a whole reg def because LIS doesn't think so. However there is a different point of view:

llvm/lib/CodeGen/RegisterPressure.cpp: line 541

void collectOperandLanes(const MachineOperand &MO) const {
...
      // Treat read-undef subreg defs as definitions of the whole register.
      if (MO.isUndef())
        SubRegIdx = 0;

I'm not sure if the undef definition with subreg should be treated as a whole reg def because LIS doesn't think so. However there is a different point of view:

llvm/lib/CodeGen/RegisterPressure.cpp: line 541

void collectOperandLanes(const MachineOperand &MO) const {
...
      // Treat read-undef subreg defs as definitions of the whole register.
      if (MO.isUndef())
        SubRegIdx = 0;

The custom pressure tracking's treatment of sub registers always seemed broken to me. The register tuples do have a stronger constraint than the same number of individual registers. Pressure is always going to be a fuzzier metric than exact liveness

! In D136918#3894272, @arsenm wrote:
The custom pressure tracking's treatment of sub registers always seemed broken to me. The register tuples do have a stronger constraint than the same number of individual registers. Pressure is always going to be a fuzzier metric than exact liveness

When I look into this now I tend to agree. We should always account for the full superreg even if only one subreg is live because RenameIndependentSubregs should split it earlier into individual registers. That would simplify RP tracking a lot.

foad added a comment.Nov 1 2022, 5:32 AM

at the beginning of bb.2.Flow %47 has 0x3F lane mask due to 928B %47:vreg_96 = IMAGE_SAMPLE_V3_V2 ...

For this example. how much of %47 is really live at the beginning of bb.2.Flow? I.e. are there uses of all lanes of %47 in that bb or its successors? (Liveness is really "due to" uses, not defs.)

at the beginning of bb.2.Flow %47 has 0x3F lane mask due to 928B %47:vreg_96 = IMAGE_SAMPLE_V3_V2 ...

For this example. how much of %47 is really live at the beginning of bb.2.Flow? I.e. are there uses of all lanes of %47 in that bb or its successors?

Full %47 is used in bb.5.endif which is the successor of bb.2.Flow:

1088B bb.5.endif:	; predecessors: %bb.2, %bb.3
1136B	  ...
1168B	  EXP_DONE 0, %47.sub0:vreg_96, %47.sub1:vreg_96, %47.sub2:vreg_96, %49:vgpr_32, -1, 0, 15, implicit $exec

(Liveness is really "due to" uses, not defs.)

This sounds like if we need to track RP upward from uses to defs.

arsenm added a comment.Nov 2 2022, 9:52 AM

(Liveness is really "due to" uses, not defs.)

This sounds like if we need to track RP upward from uses to defs.

Yes, this is how things are supposed to work. There's a lot of infrastructure that needs to be inverted

(Liveness is really "due to" uses, not defs.)

This sounds like if we need to track RP upward from uses to defs.

Yes, this is how things are supposed to work. There's a lot of infrastructure that needs to be inverted

In particular RegScavenging is used in the forward direction and relies on accurate kill flags. stepBackwards doesn't, but most everything needs to move to use that instead

(Liveness is really "due to" uses, not defs.)

This sounds like if we need to track RP upward from uses to defs.

Yes, this is how things are supposed to work. There's a lot of infrastructure that needs to be inverted

In particular RegScavenging is used in the forward direction and relies on accurate kill flags. stepBackwards doesn't, but most everything needs to move to use that instead

We have GCNUpwardRPTracker which can be used at in AMDGPU code. It was written with the idea of making passes over tentative schedules, which aren't yet implemented on instructions yet.

vpykhtin reclaimed this revision.EditedMar 2 2023, 9:26 AM

I would like to reclaim this but with different reasoning:

As mentioned before there is a problem with liveness information which can be illustrated by:

0B bb.0.entry:    successors: %bb.4, %bb.1;
...
320B	  S_CBRANCH_VCCNZ %bb.4, implicit killed $vcc

336B bb.1:	; predecessors: %bb.0, successors: %bb.2;
352B	  ...
368B	  undef %47.sub2:vreg_96 = IMPLICIT_DEF

432B bb.2.Flow:	; predecessors: %bb.4, %bb.1,  successors: %bb.3, %bb.5;
512B	  ...
...
624B	  S_BRANCH %bb.3

640B bb.3.if:    	; predecessors: %bb.2, successors: %bb.5;
656B	  ...
672B	  ...
720B	  %47:vreg_96 = IMAGE_SAMPLE_V3_V2 %48:vreg_64, ...
832B	  S_BRANCH %bb.5

848B bb.4.else:       ; predecessors: %bb.0, successors: %bb.2;
864B	  ...
...
928B	  %47:vreg_96 = IMAGE_SAMPLE_V3_V2 %40:vreg_64...
...
1072B	  S_BRANCH %bb.2

1088B bb.5.endif:	; predecessors: %bb.2, %bb.3
1136B	  ...
1168B	  EXP_DONE 0, %47.sub0:vreg_96, %47.sub1:vreg_96, %47.sub2:vreg_96, %49:vgpr_32, -1, 0, 15, implicit $exec
1184B	  S_ENDPGM 0


Live intervals for %47

%47 [368r,432B:3)[432B,640B:1)[720r,848B:0)[928r,1088B:4)[1088B,1168r:2) 0@720r 1@432B-phi 2@1088B-phi 3@368r 4@928r  

L0000000000000030 [368r,432B:3)[432B,640B:1)[720r,848B:0)[928r,1088B:4)[1088B,1168r:2) 0@720r 1@432B-phi 2@1088B-phi 3@368r 4@928r  
L000000000000000C [432B,640B:1)[720r,848B:0)[928r,1088B:4)[1088B,1168r:2) 0@720r 1@432B-phi 2@1088B-phi 3@x 4@928r  
L0000000000000003 [432B,640B:1)[720r,848B:0)[928r,1088B:4)[1088B,1168r:2) 0@720r 1@432B-phi 2@1088B-phi 3@x 4@928r

bb.2.Flow has two predecessors bb.4 and bb.1:

  • at the end of bb.1 %47 has live mask 0x30 - L0000000000000030 [368r,432B:3) due to undef %47.sub2:vreg_96 = IMPLICIT_DEF
  • at the beginning of bb.2.Flow %47 has 0x3F lane mask due to 928B %47:vreg_96 = IMAGE_SAMPLE_V3_V2 ...

In this example we're tracking bb.1 first and going to continue in bb.2.Flow with the live-outs left from bb.1 but liveout for %47 has 0x30 mask and livein should have 0x3F.

This is a fundamental problem with LiveIntervals at the moment and cannot be fixed quickly.

The affected code in this patch is only intended to speedup live-ins computation but in fact it doesn't because live-ins for every block is precalculated anyway and in efficient way, see getBBLiveInMap at line 548.

This patch fixes assert in https://reviews.llvm.org/D136267

rampitec accepted this revision.Mar 7 2023, 9:59 AM

LGTM, but please add a comment why are you checking predecessors here. Something like: "Live-ins of a successor are the same as live-outs of a predecessor, but subreg mask may be different for different predecessors."

This revision is now accepted and ready to land.Mar 7 2023, 9:59 AM
vpykhtin updated this revision to Diff 503127.Mar 7 2023, 12:11 PM
  • rebased.
  • added comment.
rampitec accepted this revision.Mar 7 2023, 12:14 PM
This revision was landed with ongoing or failed builds.Mar 8 2023, 3:20 AM
This revision was automatically updated to reflect the committed changes.