It is off by default, but can be used
with --misched=si
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I have given a few inline comments. This is a little hard to review, because I'm not really familiar with the scheduling infrastructure. My main concern is that the scheduler is too complex. Although maybe that's just the nature of schedulers.
It would be nice if the code could be organized such that there was a simple core scheduler, with several modular improvements that could be added or removed easily. Do you think this could be done?
The only requirement of the core scheduler should be that it produces consistent code, it doesn't necessarily have to be better than the current scheduler. One thing that is really bad about the current scheduler is how removing just one instruction will drastically change the structure of the program as well as the register usage. Having a consistent scheduler (even one that is worse) will make it easier to measure performance improvements of other optimizations.
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
134–135 ↗ | (On Diff #31617) | Can we add a compile option for this? |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
640 ↗ | (On Diff #31617) | This function seems to have parts that where copied from ScheduleDAGTopologicalSort::InitDAGTopologicalSorting() in ScheduleDAG.cpp. Is there some reason why we can't just re-use that function? |
753–758 ↗ | (On Diff #31617) | All of the loops like this in this function that iterate over the DAG, should be moved into their own functions (i.e one loop per function). This will make the code easier to understand and make it easier to enable/disable different passes if necessary. |
890–915 ↗ | (On Diff #31617) | This commented out section should be removed. |
917–944 ↗ | (On Diff #31617) | Can you move this loop into its own function. This will make it easier to understand what the scheduler is doing at a high level. |
955–977 ↗ | (On Diff #31617) | Can you move this loop to its own function too. |
985–1010 ↗ | (On Diff #31617) | Can you move this to its own function too. |
1014–1033 ↗ | (On Diff #31617) | This one too. |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
---|---|---|
640 ↗ | (On Diff #31617) | The function couldn't be reused in particular because it was doing only TopDown (here both TopDown and BottomUp were needed). |
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
134–135 ↗ | (On Diff #31617) | the "SISchedRegistry" part adds the compile option "-misched=si" to use the scheduler. Uncommenting these two lines just removes the need for this option. |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
753–758 ↗ | (On Diff #31617) | I think the whole code could be made more modular, have several ways of computing blocks, and several ways to schedule them (and pick the best one). Given the plan to rewrite and make more modular, should I still put these loops in several functions ? |
890–915 ↗ | (On Diff #31617) | I kept it for testing purpose. |
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
134–135 ↗ | (On Diff #34135) | Then all the SI tests should be changed to use the converge scheduler (which is the current default one), else lots Since I'm not sure the scheduler is good for opencl tasks (perhaps the converge scheduler is better then), |
lib/Target/AMDGPU/SIMachineScheduler.h | ||
173–174 ↗ | (On Diff #34135) | ok |
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
134–135 ↗ | (On Diff #34135) | While the converge scheduler is still the default, if Mesa wants to use this scheduler, then it should explicitly ask for it using the command line options. |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
---|---|---|
1382–1383 ↗ | (On Diff #34135) | I would prefer if debug printing is going to print something when it always prints yes/no rather than just if it's happening |
1416–1418 ↗ | (On Diff #34135) | These can all be in one &&ed assert |
1556 ↗ | (On Diff #34135) | The normal naming convention is for this to just be TTI rather than SITTI |
1636–1637 ↗ | (On Diff #34135) | range for and put on a separate line from the DEBUG() |
1682 ↗ | (On Diff #34135) | Capitalize / period |
lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
33–34 ↗ | (On Diff #34135) | Can you do something besides compare the name? If not, this should compare StringRefs or something rather than using the libc functions |
557 ↗ | (On Diff #34135) | llvm style omits the int here |
559–562 ↗ | (On Diff #34135) | This should be a ternary operator initialization |
570 ↗ | (On Diff #34135) | space around + |
577–582 ↗ | (On Diff #34135) | Ditto |
586 ↗ | (On Diff #34135) | Grammar: VGPR->VGPRs, SGPR->SGPRs |
588 ↗ | (On Diff #34135) | Space around * |
lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
2732–2734 ↗ | (On Diff #34135) | This can return the condition without the if. This also isn't 100% accurate. This will miss barriers and a few other cases. Better would be to check the instruction latency maybe with a threshold value of some sort. |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
128 ↗ | (On Diff #34135) | Why are these padded with spaces at the end? |
134 ↗ | (On Diff #34135) | You don't need a ; here |
174 ↗ | (On Diff #34135) | Looks like a stray comment |
200–201 ↗ | (On Diff #34135) | This sentence reads weirdly / seems to be missing some words How about: "Low latency instructions which do not depend on other low latency instructions we haven't waited for" |
202–203 ↗ | (On Diff #34135) | Ditto |
209 ↗ | (On Diff #34135) | "If never in the block there is a lot of constant loads" -> "If there are not many constant loads in the block" |
217 ↗ | (On Diff #34135) | This shouldn't be hardcoded. A pass parameter or command line option read during pass initialization |
245–246 ↗ | (On Diff #34135) | Did you mean for these to be copies? |
247 ↗ | (On Diff #34135) | Capitalize and period |
249 ↗ | (On Diff #34135) | Can you use this instead of repeating (*I)-> multiple times? |
259 ↗ | (On Diff #34135) | Dead code. Uncomment and maybe add more of a description if this is useful debug printing |
273–274 ↗ | (On Diff #34135) | This can be a range for |
291 ↗ | (On Diff #34135) | Function would be better named isDefBetween since it doesn't return the def. Something about this function doesn't seem right to me. I don't think you should be looking at the def_instructions, and instead checking the Segments of the LiveRange. |
295–297 ↗ | (On Diff #34135) | This can be a range loop over MRI->def_instructions() |
320–321 ↗ | (On Diff #34135) | This can be a range for |
371 ↗ | (On Diff #34135) | isVirtualRegister should be checked first |
400–401 ↗ | (On Diff #34135) | Range for |
419–420 ↗ | (On Diff #34135) | Can you factor this into a separate check predicate function to assert on |
423–424 ↗ | (On Diff #34135) | Range for |
426–427 ↗ | (On Diff #34135) | No space between assert and ( |
435–436 ↗ | (On Diff #34135) | Range for |
507–508 ↗ | (On Diff #34135) | Capitalize / period |
526–527 ↗ | (On Diff #34135) | Range for |
539–544 ↗ | (On Diff #34135) | This could be an std::find or factored into own function |
546–552 ↗ | (On Diff #34135) | Asserts should not be placed under DEBUG(). It would be better to put this into a separate function and assert on it |
558–563 ↗ | (On Diff #34135) | Similar std::find or separate function again |
591–592 ↗ | (On Diff #34135) | Range for |
597–598 ↗ | (On Diff #34135) | Single quotes around the single characters |
650–652 ↗ | (On Diff #34135) | No return after else |
710 ↗ | (On Diff #34135) | Space around - |
850 ↗ | (On Diff #34135) | .empty() |
870 ↗ | (On Diff #34135) | Having a set as the key to a map looks very strange. Is this really what you want? |
911–912 ↗ | (On Diff #34135) | Should use C++ style comments |
1040–1041 ↗ | (On Diff #34135) | This can be in a single DEBUG() statement |
1081 ↗ | (On Diff #34135) | Non English comment |
1123–1134 ↗ | (On Diff #34135) | This should be another separate function |
1151–1152 ↗ | (On Diff #34135) | Range for |
1242–1247 ↗ | (On Diff #34135) | No asserts under DEBUG() |
1355–1358 ↗ | (On Diff #34135) | Single quote ' ' and '\n' |
lib/Target/AMDGPU/SIMachineScheduler.h | ||
26–27 ↗ | (On Diff #34135) | This is wrapped weirdly. I would prefer one item per line |
34 ↗ | (On Diff #34135) | This looks like it should be an llvm::BitVector |
213–214 ↗ | (On Diff #34135) | Can you put a line between each function and the comment on the following declaration |
365–366 ↗ | (On Diff #34135) | These should return references |
lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
2732–2734 ↗ | (On Diff #34135) | The instruction latency for all instructions is not filled yet, we cannot rely on it. I don't think we need to say barriers are high latencies, because (unless I'm wrong) |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
128 ↗ | (On Diff #34135) | looks like for now we don't need extra spacing. I'll remove them for the next patch. |
174 ↗ | (On Diff #34135) | I just used it to signal below are the SIScheduleBlock functions. I'm ok to remove it if you don't like it. |
291 ↗ | (On Diff #34135) | The function is an adaptation of RegistrePressure.cpp 's findUseBetween function, which was doing the same but with uses. For Defs it looks like that it can be done with LiveRange. Do you really want I try to use LiveRange instead ? |
295–297 ↗ | (On Diff #34135) | I didn't know about range for before you advised me to use them. Is is possible to use them here, knowing there is a cast to MachineInstr* ? |
517 ↗ | (On Diff #34135) | There was a mistake here, it shouldn't be SU->NodeNum but the Successor NodeNum. My next patch will fix it. |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
---|---|---|
217 ↗ | (On Diff #34135) | I think this hardcoded value is suboptimal (as any hardcoded value would be) and that I don't think it makes sense to ask user to set this value. I suggest to keep that hardcoded value for now, and replace that heuristic when we find a better one. |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
---|---|---|
650–652 ↗ | (On Diff #34135) | Do you mean I should just remove the else ? |
870 ↗ | (On Diff #34135) | Yes I want to assign colors to different sets. This looked like the best way to do. |
1556 ↗ | (On Diff #34135) | I use SITTI because TTI is already taken. SITTI is TTI casted to SIInstrInfo*. |
lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
33–34 ↗ | (On Diff #34135) | Unfortunately I didn't find any other simple way than comparing the names. Something like 'if (StringRefs("SGPR_32") == getRegPressureSetName(i))' ? |
I've tried to understand what this scheduler does. As a general comment, I think it would be useful for the review if you decided upon a single variant for each of the steps of the scheduler. Having different variants makes things harder to follow.
I do like the idea of explicitly keeping track of VMEM instructions, since they have both the highest latency and the best latency-mitigation tool. I have focused on the blocks so far, and I have to say I'm not convinced that they are the right way to go. Some high-level thoughts on the topic:
- As a general approach that would fit better into the MachineSchedStrategy framework as I understand it, how about a bottom-up scheduling in which after you schedule the instruction consuming the result of a VMEM load, you keep track of how many other instructions (or better: cycles) you have already scheduled to cover the latency, and use that in decision making.
1b) There will be dependency chains between VMEMs (dependent texture loads in graphics, more general patterns in compute). For such dependency chains, it seems like a good idea to do an analysis prepass (as part of the scheduler) which collects information about stuff like how many DAG nodes are strictly between two dependent VMEM loads. This kind of information can be useful: consider a dependent texture load with a single dependency step in a fragment shader that looks like this: (interp) -> image -> (long calculation) -> image -> (short calculation) -> exp. The fact that one of these calculations is short and the other long can be obtained by a DAG traversal similar to your coloring, and it can be used to decide that the second image instruction should not be scheduled too early (in program order, or too late in bottom-up scheduling order).
- I've noticed while playing around that the blocks are of very different sizes. This alone made me a bit suspicious.
- The blocks are a form of very early decision making, and as a general rule, very early decision making tends to lead to bad results in optimization, unless it is done very well.
None of this is necessarily the final nail in the coffin for the blocks idea. I'd just like to see a better justification why scheduling at an instruction level is really insufficient. (I do think that some bigger-picture early analysis of the DAG to take latency into account is likely to be very useful.)
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
---|---|---|
25 ↗ | (On Diff #43214) | To keep recompile times lower, it would be good to get rid of this #include; AFAICT the only dependency is createSIMachineScheduler, which can be moved to SIMachineScheduler.cpp (and declared in AMDGPU.h). |
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
38 ↗ | (On Diff #43214) | What precisely do you mean by this last point? |
42–45 ↗ | (On Diff #43214) | This is hard to understand. Do you have some example of what you mean exactly, and an explanation of why it happens in the generic scheduler? |
128 ↗ | (On Diff #43214) | This function should be static. |
473 ↗ | (On Diff #43214) | Since releaseSucc() is only used here, the logic of whether to actually release or not should be moved out of it and into this loop. It is confusing to have a method releaseXxx which does not actually always release Xxx. The same actually holds for the releaseSuccessors method as a whole. This would be helped a lot if the parameter were an enum if descriptive names, such that the meaning of the parameter is clear at call sites. |
717–753 ↗ | (On Diff #43214) | This algorithm will give different colors to SUnits with different dependencies, but it does not always give the same color to SUnits with the same dependencies. Consider: High latency instructions A, B, C, followed by: Not sure whether this is critical, but it's something to keep in mind. |
757 ↗ | (On Diff #43214) | Grammar: same *as* before |
800 ↗ | (On Diff #43214) | Use std::pair instad of std::set here. |
827 ↗ | (On Diff #43214) | What is this method supposed to do? Which nodes are affected by it? |
942 ↗ | (On Diff #43214) | This method and the two above it are extremely similar. It should be possible to merge them somehow. |
lib/Target/AMDGPU/SIMachineScheduler.h | ||
190 ↗ | (On Diff #43214) | Style: first letter of method name should be lower case |
207 ↗ | (On Diff #43214) | Spelling: Grouped |
226–230 ↗ | (On Diff #43214) | I do not understand the semantics of reserved vs. non-reserved. Why the distinction? What does it mean? |
lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
669–684 ↗ | (On Diff #43214) | I believe this method is unused. |
lib/Target/AMDGPU/SIRegisterInfo.h | ||
28–30 ↗ | (On Diff #43214) | Perhaps a constant or enum could be used here instead of the magic 10. |
I agree with you doing with blocks is not optimal.
However it is one way to make the problem simpler to the scheduler and achieve better results.
Ideally some analysis could be used, as you say, to make some good decisions for the scheduling.
However I couldn't find any such good analysis pass yet, and how it could be used.
Grouping the instructions together is not enforced by this scheduler. In fact, one can have a variant that does one instruction == one block,
however so far it gives worse results, sign that this helps the second part of the schedule doing better decisions.
The thing that gave me the idea of blocks is that when you look at the original TGSI you can easily in your mind make some sub-blocks of instruction, and place the high latency instructions futher away form the users and come up with a schedule. I hoped that pass would rebuild such blocks easy to manipulate.
One thing hard during the schedule is when you have ready for schedule a lot of different instructions that all increase register usage.
Which instructions should you schedule now, in order to be able to contain register usage and decrease its usage soon ?
One solution would be to look at a lot of possible schedule order for the k next iterations and take the best one, but that's expensive.
The problem is very likely NP-hard.
Doing with blocks is a way to make it simpler for the scheduler: a relatively good decision has already been made for what are the k next instructions after each one.
lib/Target/AMDGPU/SIMachineScheduler.cpp | ||
---|---|---|
38 ↗ | (On Diff #43214) | This was a try to explain that less registers used => latencies are better hidden (better wavecounts) |
42–45 ↗ | (On Diff #43214) | Imagine you have 100 instructions v_mov REG, CONSTANT, and one VMEM instruction loading two registers. If everything else depends on that VMEM, that results in hugh vgpr usage increase. |
473 ↗ | (On Diff #43214) | I don't get the second part of the comment ? Should I replace the boolean by an enum ? |
827 ↗ | (On Diff #43214) | This is for nodes which have for example if we have (A high lat instruction) X1 doesn't have any high latency instruction dependency in both ways. Without this function, all instructions in X1 case are put in the same block. |
lib/Target/AMDGPU/SIMachineScheduler.h | ||
226–230 ↗ | (On Diff #43214) | Reserved: You can only add elements If you have better names, don't hesitate... |
Thanks for taking care of those changes!
I'm still only lukewarm to the scheduler because the blocks business seems a bit magical, and it would be nice to have a proper explanation why this is better and (more importantly) why the same cannot be achieved by improving on the existing scheduler with less of a "change the whole world" approach.
That said, the evidence is clearly that there are graphics performance improvements to be had with this scheduler, which I don't think we should keep from people. Furthermore, the change is low risk because it's nice and modular.
So all in all, I believe this should be accepted, and I will commit it soon.