Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
127 | You could early opt-out when VALUInstsThreshold == 0 (at the beginning), is that correct? |
Updated as suggested.
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
30–31 | We don't know it yet, I'm afraid. For the couple real use cases that we have for the issue that this pass tries to address counting instructions looks sufficient. This being coupled with that proper cycle counting might be not very trivial, we may be at the risk of over-engineering here. | |
127 | We could, but I'm not sure I know how that might be useful in practice. | |
132 | Yes, that's not good. Combined the two new loops into one and simplified related code. Thanks. | |
205–206 | Well, the instruction we create here is not just an auxiliary value that we coincidentally happen to use in both the cases, if that's what you mean. That instruction must be spent in all cases and we do want it be the same instruction. |
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
205–206 | But you can construct it in the right place rather than constructing the instruction and then inserting after. You have the insert point you want, you can just construct it there? It's unusual to need insert/insertAfter |
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
205–206 | I'm not sure I see how this answers the point. Granted that constructing the instruction at the points of insertion is possible, but as I said in this case we have reasons to create it in a single place. |
I'm not sure how to review this. Can you explain why this heuristic makes sense intuitively? Or do you have any benchmarks to back it up? Or preferably both? :)
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
126 | If you're ignoring loops, can you described succinctly what you *are* counting? Is it something like the minimum number of VALU instructions along any path from the start of the function to the VMEM load in question? |
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
205–206 | I agree with Matt that it would be more normal to construct-and-insert at the same time, though I don't feel very strongly about it. I think I suggested before that you could pass the insertion point into BuildSetprioMI. In this case the insertion point would be something like MBBInfos[MBB].LastVMEMLoad ? std::next(MBBInfos[MBB].LastVMEMLoad) : MBB->begin(). |
Right, that's the hardest part of it, isn't it. Because we don't have much use cases provided, the idea was to start with replicating more or less the same counting logic we have in the other compiler, and then adjust things using feedback from people who can give it some proper testing.
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
126 | Good point. Amended the comment. | |
205–206 | OK, changed. |
I don't quite follow the cross-basic-block logic. Given what I understand of the goal of the heuristic, I expect it to be: "insert a s_setprio 0 before the first long section of dense VALU that can happen after a VMEM load (if any)".
Instead of MBBInfo::LastVMEMLoad I can see a MBBInfo::LastVALUSequenceBreak and an MBBInfo::PastVMEMLoad boolean to indicate whether a VMEM load can have been issued at the end of each basic block.
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
30–31 | Could this also be configured via a function attribute? Those are less problematic than command-line options in a driver context. | |
134–135 | I believe NumVALUInsts should be reset to 0 here. It should probably also be reset at a number of other events, in particular DS instructions. The reasoning is that we want to lower priority just before running a long dense block of VALU, so that other waves have a better chance of running address calculation VALU. |
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
134–135 | So would something like this work for the purpose? if (SIInstrInfo::isVALU(MI)) ++NumVALUInsts; else NumVALUInsts = 0; |
Updated to count VALU instructions that follow VMEM loads and to add
support for threshold function attributes.
I think knowing if we are at a point past a VMEM load wouldn't help as there may be another VMEM load down the control flow. And a similar problem with the last sequence break as at where the sequence finally becomes 'long enough', we can have several points at which it begins.
If I'm not wrong that there is no harm in lowering the priority immediately after VMEM loads, it seems it's still easier implementation-wise to track them and not other things. The updated version just makes sure we only consider VMEM loads that are followed by VALU sequences of the required length. Then the rest of the logic remains the same.
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
30–31 | Done. | |
134–135 | Updated to break the sequence on DS instructions. |
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
142–143 | Why are you only counting if AtStart? The idea was: find places where a VMEM load is followed by a long sequence of dense VALU. Lower priority between that VMEM load and the dense VALU. |
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
142–143 | Thank you for the offline clarification, I understand this better now. | |
159–163 | If the winner is MaxNumVALUInstsInMiddle, I believe the relevant VMEMLoad should be the VMEMLoad that appeared just before the corresponding sequence of VALU instead of the last one. I think that instead of Info.LastVMEMLoad, we'd want to have Info.LastVMEMLoadBeforeLongVALU (feel free to think of a better name), which is only set here at the end if the threshold is exceeded. The loop above would keep track of the most recently seen VMEMLoad as well as the one corresponding to the longest VALU sequence in the middle so far. |
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
159–163 | The ticket says the priority is supposed to be lowered after the last VMEM load, and this updated implementation has been prepared in assumption that counting VALUs doesn't affect that principle. So if this point stands, then I understand we should not be interested in any VALU sequences preceding VMEM loads, even if long enough. In terms of the code, as we reset MaxNumVALUInstsInMiddle every time we ran into a VMEM load, I'm not sure I see how an in-the-middle sequence followed by a VMEM load can possibly be the winner. Info.LastVMEMLoad storing the actual last VMEM load in the block should not be a problem because it is the Info.MayReachVMEMLoad flag that the following analysis takes into account. And speaking of naming, I'm not perfectly happy with the the name of that flag as it actually means 'may reach any of the last VMEM loads that precede a long-enough sequence of VALU instructions'. Would appreciate if anyone can suggest a better alternative of a practical length. |
llvm/lib/Target/AMDGPU/AMDGPUSetWavePriority.cpp | ||
---|---|---|
159–163 | I did think the policy was supposed to be "lower priority before the longest sequence of VALU instructions, if the sequence length crosses a certain threshold". But we can start with "lower after last VMEM load, if subsequent VALU sequence is long enough". So then, you're right and the code is okay as-is. |
@nhaehnle Are we fine with accepting and submitting this? Or, do we want to give this some testing beforehand? Also not quite sure if the pass is supposed to be enabled to by default.
Should we be counting cycles instead of instructions?