This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Only raise wave priority if there is a long enough sequence of VALU instructions.
ClosedPublic

Authored by kosarev on Apr 29 2022, 6:14 AM.

Diff Detail

Event Timeline

kosarev created this revision.Apr 29 2022, 6:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 6:14 AM
kosarev requested review of this revision.Apr 29 2022, 6:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 29 2022, 6:14 AM
tsymalla added inline comments.Apr 29 2022, 7:30 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
135

You could early opt-out when VALUInstsThreshold == 0 (at the beginning), is that correct?

arsenm added inline comments.Apr 29 2022, 10:58 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31

Should we be counting cycles instead of instructions?

140

You're counting VALU instructions here and above?

198–199

Why not construct directly at the insert point?

kosarev updated this revision to Diff 426646.May 3 2022, 4:40 AM

Updated as suggested.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31

We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that this pass tries to address counting instructions looks sufficient. This being coupled with that proper cycle counting might be not very trivial, we may be at the risk of over-engineering here.

135

We could, but I'm not sure I know how that might be useful in practice.

140

Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks.

198–199

Well, the instruction we create here is not just an auxiliary value that we coincidentally happen to use in both the cases, if that's what you mean. That instruction must be spent in all cases and we do want it be the same instruction.

arsenm added inline comments.May 17 2022, 2:46 PM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
198–199

But you can construct it in the right place rather than constructing the instruction and then inserting after. You have the insert point you want, you can just construct it there? It's unusual to need insert/insertAfter

kosarev added inline comments.May 18 2022, 1:01 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
198–199

I'm not sure I see how this answers the point. Granted that constructing the instruction at the points of insertion is possible, but as I said in this case we have reasons to create it in a single place.

foad added a comment.May 18 2022, 2:45 AM

I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
115

If you're ignoring loops, can you described succinctly what you *are* counting? Is it something like the minimum number of VALU instructions along any path from the start of the function to the VMEM load in question?

foad added inline comments.May 18 2022, 2:50 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
198–199

I agree with Matt that it would be more normal to construct-and-insert at the same time, though I don't feel very strongly about it. I think I suggested before that you could pass the insertion point into BuildSetprioMI. In this case the insertion point would be something like MBBInfos[MBB].LastVMEMLoad ? std::next(MBBInfos[MBB].LastVMEMLoad) : MBB->begin().

kosarev updated this revision to Diff 430382.May 18 2022, 7:54 AM

Reworked as suggested.

I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)

Right, that's the hardest part of it, isn't it. Because we don't have much use cases provided, the idea was to start with replicating more or less the same counting logic we have in the other compiler, and then adjust things using feedback from people who can give it some proper testing.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
115

Good point. Amended the comment.

198–199

OK, changed.

I don't quite follow the cross-basic-block logic. Given what I understand of the goal of the heuristic, I expect it to be: "insert a s_setprio 0 before the first long section of dense VALU that can happen after a VMEM load (if any)".

Instead of MBBInfo::LastVMEMLoad I can see a MBBInfo::LastVALUSequenceBreak and an MBBInfo::PastVMEMLoad boolean to indicate whether a VMEM load can have been issued at the end of each basic block.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31

Could this also be configured via a function attribute? Those are less problematic than command-line options in a driver context.

155–156

I believe NumVALUInsts should be reset to 0 here. It should probably also be reset at a number of other events, in particular DS instructions.

The reasoning is that we want to lower priority just before running a long dense block of VALU, so that other waves have a better chance of running address calculation VALU.

kosarev added inline comments.Aug 8 2022, 4:44 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
155–156

So would something like this work for the purpose?

if (SIInstrInfo::isVALU(MI))
  ++NumVALUInsts;
else
  NumVALUInsts = 0;
kosarev updated this revision to Diff 452134.Aug 12 2022, 3:45 AM

Updated to count VALU instructions that follow VMEM loads and to add
support for threshold function attributes.

kosarev marked 2 inline comments as done.Aug 12 2022, 4:15 AM

I don't quite follow the cross-basic-block logic. Given what I understand of the goal of the heuristic, I expect it to be: "insert a s_setprio 0 before the first long section of dense VALU that can happen after a VMEM load (if any)".

Instead of MBBInfo::LastVMEMLoad I can see a MBBInfo::LastVALUSequenceBreak and an MBBInfo::PastVMEMLoad boolean to indicate whether a VMEM load can have been issued at the end of each basic block.

I think knowing if we are at a point past a VMEM load wouldn't help as there may be another VMEM load down the control flow. And a similar problem with the last sequence break as at where the sequence finally becomes 'long enough', we can have several points at which it begins.

If I'm not wrong that there is no harm in lowering the priority immediately after VMEM loads, it seems it's still easier implementation-wise to track them and not other things. The updated version just makes sure we only consider VMEM loads that are followed by VALU sequences of the required length. Then the rest of the logic remains the same.

llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
30–31

Done.

155–156

Updated to break the sequence on DS instructions.

kosarev updated this revision to Diff 452140.Aug 12 2022, 4:27 AM
kosarev marked an inline comment as done.

Update commit description.

kosarev edited the summary of this revision. (Show Details)Aug 12 2022, 4:27 AM
kosarev updated this revision to Diff 452164.Aug 12 2022, 5:59 AM

Update commit title as well.

kosarev retitled this revision from [AMDGPU] Do not raise wave priority beyond a specific number of VALU instructions. to [AMDGPU] Only raise wave priority if there is a long enough sequence of VALU instructions..Aug 12 2022, 5:59 AM
nhaehnle added inline comments.Aug 31 2022, 3:51 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
170–171

Why are you only counting if AtStart? The idea was: find places where a VMEM load is followed by a long sequence of dense VALU. Lower priority between that VMEM load and the dense VALU.

nhaehnle added inline comments.Sep 1 2022, 2:35 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
168–172

If the winner is MaxNumVALUInstsInMiddle, I believe the relevant VMEMLoad should be the VMEMLoad that appeared just before the corresponding sequence of VALU instead of the last one.

I think that instead of Info.LastVMEMLoad, we'd want to have Info.LastVMEMLoadBeforeLongVALU (feel free to think of a better name), which is only set here at the end if the threshold is exceeded. The loop above would keep track of the most recently seen VMEMLoad as well as the one corresponding to the longest VALU sequence in the middle so far.

170–171

Thank you for the offline clarification, I understand this better now.

kosarev added inline comments.Sep 1 2022, 4:53 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
168–172

The ticket says the priority is supposed to be lowered after the last VMEM load, and this updated implementation has been prepared in assumption that counting VALUs doesn't affect that principle. So if this point stands, then I understand we should not be interested in any VALU sequences preceding VMEM loads, even if long enough.

In terms of the code, as we reset MaxNumVALUInstsInMiddle every time we ran into a VMEM load, I'm not sure I see how an in-the-middle sequence followed by a VMEM load can possibly be the winner.

Info.LastVMEMLoad storing the actual last VMEM load in the block should not be a problem because it is the Info.MayReachVMEMLoad flag that the following analysis takes into account. And speaking of naming, I'm not perfectly happy with the the name of that flag as it actually means 'may reach any of the last VMEM loads that precede a long-enough sequence of VALU instructions'. Would appreciate if anyone can suggest a better alternative of a practical length.

nhaehnle added inline comments.Sep 1 2022, 4:59 AM
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp
168–172

I did think the policy was supposed to be "lower priority before the longest sequence of VALU instructions, if the sequence length crosses a certain threshold". But we can start with "lower after last VMEM load, if subsequent VALU sequence is long enough". So then, you're right and the code is okay as-is.

@nhaehnle Are we fine with accepting and submitting this? Or, do we want to give this some testing beforehand? Also not quite sure if the pass is supposed to be enabled to by default.

nhaehnle accepted this revision.Sep 7 2022, 1:03 PM

Yes, I think it's fine.

This revision is now accepted and ready to land.Sep 7 2022, 1:03 PM
This revision was landed with ongoing or failed builds.Sep 8 2022, 7:22 AM
This revision was automatically updated to reflect the committed changes.