This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Adjust wave priority based on VMEM instructions to avoid duty-cycling.
ClosedPublic

Authored by kosarev on Apr 22 2022, 4:22 AM.

Details

Summary

As older waves execute long sequences of VALU instructions, this may
prevent younger waves from address calculation and then issuing their
VMEM loads, which in turn leads the VALU unit to idle. This patch tries
to prevent this by temporarily raising the wave's priority.

A few notes and questions:

  • This intentionally avoids introducing any counting of VALU instructions as at this moment it's not entirely clear how we would want to do that in presence of branches and loops. However, the patch aims to make adding support for such counting as simple as possible by separating identification of VMEM loads we want to take into account from the rest of the logic.
  • The implementation assumes that one s_setprio 0 followed by another one is acceptable when edge splitting would be the only way to avoid this.
  • I understand we want this be enabled by default, but left the pass disabled for now until it's in production state -- for the sake of not touching tests that should not normally be affected by the pass.
  • If the s_setprio instruction is not universally available, I guess we may want to make sure the selected target does actually support it?
  • What do we do if s_setprios already present in the input code? Would doing nothing be acceptable and reasonable in such a case?

Diff Detail

Event Timeline

kosarev created this revision.Apr 22 2022, 4:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2022, 4:22 AM
kosarev requested review of this revision.Apr 22 2022, 4:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2022, 4:22 AM
kosarev updated this revision to Diff 424435.Apr 22 2022, 4:27 AM

Fix Lint complaints.

kosarev edited the summary of this revision. (Show Details)Apr 22 2022, 4:56 AM
kosarev added reviewers: foad, dstuttard, tpr.
foad added inline comments.Apr 22 2022, 4:57 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
32

It's a bit more normal to put "Enable" options in AMDGPUTargetMachine, where you can use isPassEnabled.

123

Should bail out early if skipFunction(MF.getFunction()), to help with bisecting optimization problems.

foad added inline comments.Apr 22 2022, 5:12 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43

This seems like overkill. Can you just use a DenseMap<const MachineBasicBlock *, bool> directly? Or maybe just a DenseSet<const MachineBasicBlock *>?

98

I think it would be neater to have this function take a MachineBasicBlock::iterator instead of MF, and have it automatically insert the created instruction (by calling one of the BuildMI overloads that does the insertion).

foad added inline comments.Apr 22 2022, 5:41 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
103–104

I think this deserves a comment along the lines of: Check that for every predecessor Pred that can reach a VMEM load, none of Pred's successors can reach a VMEM load.

108–109

I don't understand what this is for. If Pred can reach a VMEM load and it is in a loop, then it must surely have at least one successor that can also reach a VMEM load (namely a successor that is on a path around the loop that leads back to Pred), so we will return false anyway on line 111 below.

137

Could use:

if (any_of(MBB, [&](const MachineInstr &MI){ return isVMEMLoad(MI); }))
  Worklist.push_back(&MBB);

instead of an explicit loop, here and elsewhere - although it is a matter of taste.

tsymalla added inline comments.Apr 22 2022, 5:58 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
32

Probably invert that options so that setting it to true enables the pass

131

Can be moved to the beginning of the function.

160

Could you comment that section please?

164

Typo: priority

192

Could be using the reverse iterator

jpages added a subscriber: jpages.Apr 22 2022, 6:54 AM
arsenm added inline comments.Apr 22 2022, 7:10 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
137

Is it really worthwhile to do this for all blocks? Should there be some kind of memory instruction count threshold?

192

Probably want llvm::make_early_inc_range(llvm::reverse(*MBB))

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
386 ↗(On Diff #424435)

Does it really need to run right here? Can you move it earlier to share the loop analysis?

kosarev updated this revision to Diff 424500.Apr 22 2022, 9:12 AM

Updated.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
32

Changed to use isPassEnabled(), thanks.

43

On the other hand, if we know we will need more fields there to count VALUs, then it's probably not worth to rewrite it back and forth? (I don't mind either way.)

98

The inserting BuildMI() seem take both the MF/MBB and the iterator. Then, the aim was this function to create the instruction and leave the client to choose the way it wants to insert it, so we don't need to rewrite the function if we need to insertAfter() or something (which was already the case with some of the early versions of the patch).

103–104

Done.

108–109

Indeed. Removed. Thanks.

123

Done.

131

Moved out to AMDGPUTargetMachine.cpp.

137

Sure, done.

160

Done.

164

Fixed, thanks.

192

Yeah, I tried that, but MBB->rend()->getIterator() seems to unable to return a valid iterator and crashes. I'm also not quite sure that would simplify the logic.

foad added inline comments.Apr 22 2022, 9:21 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43

if we know we will need more fields there to count VALUs

I don't know that :)

In any case, why do you use a map of pointers to MBBInfo, instead of just a map of MBBInfo? That seems wasteful.

kosarev added inline comments.Apr 22 2022, 9:35 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
137

I understand there is going to be some counting for VALU instructions, but not aware of any plans to have thresholds for VMEMs. I guess others may know better.

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll
386 ↗(On Diff #424435)

The Jay's finding eliminates the need for the loop analysis altogether, but speaking in general because this pass doesn't cause any major changes, we likely want it as late as possible.

kosarev added inline comments.Apr 22 2022, 10:11 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43

I think the main desire was to have references to MBBInfo remain valid on insertions but also avoid the risk of DenseMap be dealing with what may potentially become values of larger size. Will give it another thought on Monday. : )

kosarev added inline comments.Apr 25 2022, 3:43 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43

So if I'm not wrong that DenseMap<> may reallocate values on insertion, that means some rather innocently looking things like MBBInfos[A].NumSth + MBBInfos[B].NumSth being outlawed, and they are probably not the easiest things to catch during preparing changes or review.

If we see these few extra lines preventing the risk a complication beyond immediate needs, then I don't mind removing them, but do we really think it's worth it?

foad added inline comments.Apr 25 2022, 4:04 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43

I would prefer not to have the extra complication. Other reviewers might disagree.

foad added inline comments.Apr 25 2022, 4:09 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43

Actually I don't understand your argument. If MBBInfos[A].NumSth + MBBInfos[B].NumSth is not safe, then how does adding a level of indirection and writing MBBInfos[A]->NumSth + MBBInfos[B]->NumSth make it any safer?

kosarev added inline comments.Apr 25 2022, 4:35 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43

Not sure I understand how levels of indirection are relevant, but if we still compare using the MBBInfoSet::operator[] above with using DenseMap::operator[] directly, then I think what makes the first safer is that it doesn't return references to reallocatable storage. Meaning in MBBInfos[A].NumSth + MBBInfos[B].NumSth we don't depend on the order of calling the MBBInfoSet::operator[]s and the structure dereferences.

foad added inline comments.Apr 25 2022, 4:54 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
42

If you keep this code, variables names should use CamelCase like Info.

43

OK, I understand now. Sorry I had missed (or forgotten) that you had written your own operator[].

Personally I still prefer not to have the extra complication.

kosarev updated this revision to Diff 425153.Apr 26 2022, 2:00 AM

Updated to map blocks directly to info structures.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
43

OK, removed.

foad added inline comments.Apr 26 2022, 2:35 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
96

I think it only makes sense to run this pass on kernels (i.e. the entry point of a wave), not on subfunctions, so you should bail out if !AMDGPU::isKernel(MF.getFunction().getCallingConv()) or similar.

132

I could be wrong but I don't think kernels have a normal "return" at the end. Maybe change this to MBB.succ_empty() to make it a bit more generic?

foad added a comment.Apr 26 2022, 2:38 AM

The implementation assumes that one s_setprio 0 followed by another one is acceptable when edge splitting would be the only way to avoid this.

Sounds good to me.

If the s_setprio instruction is not universally available, I guess we may want to make sure the selected target does actually support it?

It is universally available.

foad added inline comments.Apr 26 2022, 2:44 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
137

A very minor point: you only lower priority in any predecessor if you can do it in all predecessors. Would it make sense to lower priority in some predecessors even if you can't do it in all of them?

kosarev updated this revision to Diff 425177.Apr 26 2022, 4:23 AM

Updated as suggested.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
96

Would AMDGPU::isEntryFunctionCC() be the right function for that?

132

Changed. For amdgpu_ps shaders in the test I do see return MIs, though they seem to be emitted just as ; return to shader part epilog comments.

137

Yeah, good question. My analysis here is that while we know that would cost us extra time executing the s_setprio 0 in MBB when control comes from one of such predecessors, I presently do not have any evidence at hand that that would help cases considered critical or at least important. The thinking therefore was that strategically we might prefer let the problem, if there is any, reveal itself and via that get a concrete reproducer, rather than to speculate early and never know for sure if it's actually necessary.

I probably should clarify that I see this patch as an attempt to introduce a reliable ground for developing our understanding of the nature of the original issue (which I guess may take some time testing and collecting relevant use cases), and not an implementation that tries to do the best thing for all theoretically possible cases right away.

foad accepted this revision.Apr 26 2022, 5:14 AM

LGTM, thanks!

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
96

Oh yes, isEntry* looks more appropriate than isKernel*.

137

OK!

This revision is now accepted and ready to land.Apr 26 2022, 5:14 AM