This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU/SI: Add SI Machine Scheduler
ClosedPublic

Authored by axeldavy on Aug 9 2015, 6:01 AM.

Details

Diff Detail

Repository
rL LLVM

Event Timeline

axeldavy updated this revision to Diff 31617.Aug 9 2015, 6:01 AM
axeldavy retitled this revision from to AMDGPU/SI: Add SI Machine Scheduler.
axeldavy updated this object.
axeldavy added a subscriber: llvm-commits.

I have given a few inline comments. This is a little hard to review, because I'm not really familiar with the scheduling infrastructure. My main concern is that the scheduler is too complex. Although maybe that's just the nature of schedulers.

It would be nice if the code could be organized such that there was a simple core scheduler, with several modular improvements that could be added or removed easily. Do you think this could be done?

The only requirement of the core scheduler should be that it produces consistent code, it doesn't necessarily have to be better than the current scheduler. One thing that is really bad about the current scheduler is how removing just one instruction will drastically change the structure of the program as well as the register usage. Having a consistent scheduler (even one that is worse) will make it easier to measure performance improvements of other optimizations.

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
134–135 ↗(On Diff #31617)

Can we add a compile option for this?

lib/Target/AMDGPU/SIMachineScheduler.cpp
640 ↗(On Diff #31617)

This function seems to have parts that where copied from ScheduleDAGTopologicalSort::InitDAGTopologicalSorting() in ScheduleDAG.cpp. Is there some reason why we can't just re-use that function?

753–758 ↗(On Diff #31617)

All of the loops like this in this function that iterate over the DAG, should be moved into their own functions (i.e one loop per function). This will make the code easier to understand and make it easier to enable/disable different passes if necessary.

890–915 ↗(On Diff #31617)

This commented out section should be removed.

917–944 ↗(On Diff #31617)

Can you move this loop into its own function. This will make it easier to understand what the scheduler is doing at a high level.

955–977 ↗(On Diff #31617)

Can you move this loop to its own function too.

985–1010 ↗(On Diff #31617)

Can you move this to its own function too.

1014–1033 ↗(On Diff #31617)

This one too.

axeldavy added inline comments.Aug 21 2015, 2:14 PM
lib/Target/AMDGPU/SIMachineScheduler.cpp
640 ↗(On Diff #31617)

The function couldn't be reused in particular because it was doing only TopDown (here both TopDown and BottomUp were needed).

axeldavy added inline comments.Aug 27 2015, 10:22 AM
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
134–135 ↗(On Diff #31617)

the "SISchedRegistry" part adds the compile option "-misched=si" to use the scheduler.

Uncommenting these two lines just removes the need for this option.

lib/Target/AMDGPU/SIMachineScheduler.cpp
753–758 ↗(On Diff #31617)

I think the whole code could be made more modular, have several ways of computing blocks, and several ways to schedule them (and pick the best one).
I plan to do the rewrite.
After a discussion, we decided the code could be merged before the rewrite.

Given the plan to rewrite and make more modular, should I still put these loops in several functions ?
I think having the passes far away from each other in separate functions will make it more complicated.

890–915 ↗(On Diff #31617)

I kept it for testing purpose.
On some games it gives a 10% increase in performance, but in general it is a slight decrease.

axeldavy updated this revision to Diff 34135.Sep 7 2015, 3:09 AM
axeldavy edited edge metadata.

I did introduce new classes to have things more modularized and more readable.

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
134–135 ↗(On Diff #34135)

Ok. We'll need to remove these commented lines before the patch is committed.

lib/Target/AMDGPU/SIMachineScheduler.h
173–174 ↗(On Diff #34135)

Coding style, }; should be on new line.

axeldavy marked an inline comment as done.Sep 10 2015, 8:42 AM
axeldavy added inline comments.
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
134–135 ↗(On Diff #34135)

Then all the SI tests should be changed to use the converge scheduler (which is the current default one), else lots
of them will fail (because instruction order differs).

Since I'm not sure the scheduler is good for opencl tasks (perhaps the converge scheduler is better then),
I was thinking having mesa explicitly ask for the scheduler was better, but another option is to that opencl programs ask for the converge scheduler (or whatever else) if this scheduler is not good for them.

lib/Target/AMDGPU/SIMachineScheduler.h
173–174 ↗(On Diff #34135)

ok

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
134–135 ↗(On Diff #34135)

While the converge scheduler is still the default, if Mesa wants to use this scheduler, then it should explicitly ask for it using the command line options.

arsenm added inline comments.Sep 14 2015, 2:37 PM
lib/Target/AMDGPU/SIMachineScheduler.cpp
1382–1383 ↗(On Diff #34135)

I would prefer if debug printing is going to print something when it always prints yes/no rather than just if it's happening

1416–1418 ↗(On Diff #34135)

These can all be in one &&ed assert

1556 ↗(On Diff #34135)

The normal naming convention is for this to just be TTI rather than SITTI

1636–1637 ↗(On Diff #34135)

range for and put on a separate line from the DEBUG()

1682 ↗(On Diff #34135)

Capitalize / period

lib/Target/AMDGPU/SIRegisterInfo.cpp
33–34 ↗(On Diff #34135)

Can you do something besides compare the name? If not, this should compare StringRefs or something rather than using the libc functions

557 ↗(On Diff #34135)

llvm style omits the int here

559–562 ↗(On Diff #34135)

This should be a ternary operator initialization

570 ↗(On Diff #34135)

space around +

577–582 ↗(On Diff #34135)

Ditto

586 ↗(On Diff #34135)

Grammar: VGPR->VGPRs, SGPR->SGPRs

588 ↗(On Diff #34135)

Space around *

arsenm added inline comments.Sep 14 2015, 2:37 PM
lib/Target/AMDGPU/SIInstrInfo.cpp
2732–2734 ↗(On Diff #34135)

This can return the condition without the if.

This also isn't 100% accurate. This will miss barriers and a few other cases. Better would be to check the instruction latency maybe with a threshold value of some sort.

lib/Target/AMDGPU/SIMachineScheduler.cpp
128 ↗(On Diff #34135)

Why are these padded with spaces at the end?

134 ↗(On Diff #34135)

You don't need a ; here

174 ↗(On Diff #34135)

Looks like a stray comment

200–201 ↗(On Diff #34135)

This sentence reads weirdly / seems to be missing some words

How about:

"Low latency instructions which do not depend on other low latency instructions we haven't waited for"

202–203 ↗(On Diff #34135)

Ditto

209 ↗(On Diff #34135)

"If never in the block there is a lot of constant loads" -> "If there are not many constant loads in the block"

217 ↗(On Diff #34135)

This shouldn't be hardcoded. A pass parameter or command line option read during pass initialization

245–246 ↗(On Diff #34135)

Did you mean for these to be copies?

247 ↗(On Diff #34135)

Capitalize and period

249 ↗(On Diff #34135)

Can you use this instead of repeating (*I)-> multiple times?

259 ↗(On Diff #34135)

Dead code. Uncomment and maybe add more of a description if this is useful debug printing

273–274 ↗(On Diff #34135)

This can be a range for

291 ↗(On Diff #34135)

Function would be better named isDefBetween since it doesn't return the def.

Something about this function doesn't seem right to me. I don't think you should be looking at the def_instructions, and instead checking the Segments of the LiveRange.

295–297 ↗(On Diff #34135)

This can be a range loop over MRI->def_instructions()

320–321 ↗(On Diff #34135)

This can be a range for

371 ↗(On Diff #34135)

isVirtualRegister should be checked first

400–401 ↗(On Diff #34135)

Range for

419–420 ↗(On Diff #34135)

Can you factor this into a separate check predicate function to assert on

423–424 ↗(On Diff #34135)

Range for

426–427 ↗(On Diff #34135)

No space between assert and (

435–436 ↗(On Diff #34135)

Range for

507–508 ↗(On Diff #34135)

Capitalize / period

526–527 ↗(On Diff #34135)

Range for

539–544 ↗(On Diff #34135)

This could be an std::find or factored into own function

546–552 ↗(On Diff #34135)

Asserts should not be placed under DEBUG(). It would be better to put this into a separate function and assert on it

558–563 ↗(On Diff #34135)

Similar std::find or separate function again

591–592 ↗(On Diff #34135)

Range for

597–598 ↗(On Diff #34135)

Single quotes around the single characters

650–652 ↗(On Diff #34135)

No return after else

710 ↗(On Diff #34135)

Space around -

850 ↗(On Diff #34135)

.empty()

870 ↗(On Diff #34135)

Having a set as the key to a map looks very strange. Is this really what you want?

911–912 ↗(On Diff #34135)

Should use C++ style comments

1040–1041 ↗(On Diff #34135)

This can be in a single DEBUG() statement

1081 ↗(On Diff #34135)

Non English comment

1123–1134 ↗(On Diff #34135)

This should be another separate function

1151–1152 ↗(On Diff #34135)

Range for

1242–1247 ↗(On Diff #34135)

No asserts under DEBUG()

1355–1358 ↗(On Diff #34135)

Single quote ' ' and '\n'

lib/Target/AMDGPU/SIMachineScheduler.h
26–27 ↗(On Diff #34135)

This is wrapped weirdly. I would prefer one item per line

34 ↗(On Diff #34135)

This looks like it should be an llvm::BitVector

213–214 ↗(On Diff #34135)

Can you put a line between each function and the comment on the following declaration

365–366 ↗(On Diff #34135)

These should return references

axeldavy added inline comments.Sep 30 2015, 1:07 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
2732–2734 ↗(On Diff #34135)

The instruction latency for all instructions is not filled yet, we cannot rely on it.

I don't think we need to say barriers are high latencies, because (unless I'm wrong)
they will separate the scheduling regions (the instruction is not moved).

lib/Target/AMDGPU/SIMachineScheduler.cpp
128 ↗(On Diff #34135)

looks like for now we don't need extra spacing. I'll remove them for the next patch.

174 ↗(On Diff #34135)

I just used it to signal below are the SIScheduleBlock functions. I'm ok to remove it if you don't like it.

291 ↗(On Diff #34135)

The function is an adaptation of RegistrePressure.cpp 's findUseBetween function, which was doing the same but with uses.
For Uses it makes sense to use use_instr_nodbg_begin because there doesn't seem to be other way with the LiveRange.

For Defs it looks like that it can be done with LiveRange.

Do you really want I try to use LiveRange instead ?
This version with def_instr_begin works and is fast enough.

295–297 ↗(On Diff #34135)

I didn't know about range for before you advised me to use them.

Is is possible to use them here, knowing there is a cast to MachineInstr* ?

517 ↗(On Diff #34135)

There was a mistake here, it shouldn't be SU->NodeNum but the Successor NodeNum.

My next patch will fix it.

axeldavy added inline comments.Sep 30 2015, 2:51 AM
lib/Target/AMDGPU/SIMachineScheduler.cpp
217 ↗(On Diff #34135)

I think this hardcoded value is suboptimal (as any hardcoded value would be) and that
the length of the block and other stuffs should be taken into account for that heuristic.

I don't think it makes sense to ask user to set this value.

I suggest to keep that hardcoded value for now, and replace that heuristic when we find a better one.

axeldavy added inline comments.Sep 30 2015, 9:22 AM
lib/Target/AMDGPU/SIMachineScheduler.cpp
650–652 ↗(On Diff #34135)

Do you mean I should just remove the else ?

870 ↗(On Diff #34135)

Yes I want to assign colors to different sets. This looked like the best way to do.

1556 ↗(On Diff #34135)

I use SITTI because TTI is already taken.

SITTI is TTI casted to SIInstrInfo*.
I use SITTI several times. It avoids casting again.

axeldavy added inline comments.Sep 30 2015, 9:47 AM
lib/Target/AMDGPU/SIRegisterInfo.cpp
33–34 ↗(On Diff #34135)

Unfortunately I didn't find any other simple way than comparing the names.
There is probably an alternative way which would be have llvm generate variables with the pressure sets names containing their ID, as is done for pressure classes, but I haven't investigated that way.

Something like 'if (StringRefs("SGPR_32") == getRegPressureSetName(i))' ?

axeldavy updated this revision to Diff 43211.Dec 18 2015, 3:01 AM

Updated diff. A lot of cleanups.

axeldavy marked 39 inline comments as done.Dec 18 2015, 3:15 AM
axeldavy updated this revision to Diff 43214.Dec 18 2015, 3:17 AM

Change uses of " " to ' '

axeldavy marked 2 inline comments as done.Dec 18 2015, 3:17 AM

I've tried to understand what this scheduler does. As a general comment, I think it would be useful for the review if you decided upon a single variant for each of the steps of the scheduler. Having different variants makes things harder to follow.

I do like the idea of explicitly keeping track of VMEM instructions, since they have both the highest latency and the best latency-mitigation tool. I have focused on the blocks so far, and I have to say I'm not convinced that they are the right way to go. Some high-level thoughts on the topic:

  1. As a general approach that would fit better into the MachineSchedStrategy framework as I understand it, how about a bottom-up scheduling in which after you schedule the instruction consuming the result of a VMEM load, you keep track of how many other instructions (or better: cycles) you have already scheduled to cover the latency, and use that in decision making.

1b) There will be dependency chains between VMEMs (dependent texture loads in graphics, more general patterns in compute). For such dependency chains, it seems like a good idea to do an analysis prepass (as part of the scheduler) which collects information about stuff like how many DAG nodes are strictly between two dependent VMEM loads. This kind of information can be useful: consider a dependent texture load with a single dependency step in a fragment shader that looks like this: (interp) -> image -> (long calculation) -> image -> (short calculation) -> exp. The fact that one of these calculations is short and the other long can be obtained by a DAG traversal similar to your coloring, and it can be used to decide that the second image instruction should not be scheduled too early (in program order, or too late in bottom-up scheduling order).

  1. I've noticed while playing around that the blocks are of very different sizes. This alone made me a bit suspicious.
  1. The blocks are a form of very early decision making, and as a general rule, very early decision making tends to lead to bad results in optimization, unless it is done very well.

None of this is necessarily the final nail in the coffin for the blocks idea. I'd just like to see a better justification why scheduling at an instruction level is really insufficient. (I do think that some bigger-picture early analysis of the DAG to take latency into account is likely to be very useful.)

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
25 ↗(On Diff #43214)

To keep recompile times lower, it would be good to get rid of this #include; AFAICT the only dependency is createSIMachineScheduler, which can be moved to SIMachineScheduler.cpp (and declared in AMDGPU.h).

lib/Target/AMDGPU/SIMachineScheduler.cpp
38 ↗(On Diff #43214)

What precisely do you mean by this last point?

42–45 ↗(On Diff #43214)

This is hard to understand. Do you have some example of what you mean exactly, and an explanation of why it happens in the generic scheduler?

128 ↗(On Diff #43214)

This function should be static.

473 ↗(On Diff #43214)

Since releaseSucc() is only used here, the logic of whether to actually release or not should be moved out of it and into this loop. It is confusing to have a method releaseXxx which does not actually always release Xxx.

The same actually holds for the releaseSuccessors method as a whole. This would be helped a lot if the parameter were an enum if descriptive names, such that the meaning of the parameter is clear at call sites.

717–753 ↗(On Diff #43214)

This algorithm will give different colors to SUnits with different dependencies, but it does not always give the same color to SUnits with the same dependencies. Consider:

High latency instructions A, B, C, followed by:
X1 = op A, B (has predecessors A, B, gets new color N)
X2 = op X1, C (has predecessor colors N, C, gets new color N + 1)
Y1 = op A, C (has predecessors A, C, gets new color N + 2)
Y2 = op Y1, B (has predecessor colors N + 2, B, gets new color N + 3)
X2 and Y2 have the same set of high latency predecessors, but are colored differently.

Not sure whether this is critical, but it's something to keep in mind.

757 ↗(On Diff #43214)

Grammar: same *as* before

800 ↗(On Diff #43214)

Use std::pair instad of std::set here.

827 ↗(On Diff #43214)

What is this method supposed to do? Which nodes are affected by it?

942 ↗(On Diff #43214)

This method and the two above it are extremely similar. It should be possible to merge them somehow.

lib/Target/AMDGPU/SIMachineScheduler.h
190 ↗(On Diff #43214)

Style: first letter of method name should be lower case

207 ↗(On Diff #43214)

Spelling: Grouped

226–230 ↗(On Diff #43214)

I do not understand the semantics of reserved vs. non-reserved. Why the distinction? What does it mean?

lib/Target/AMDGPU/SIRegisterInfo.cpp
669–684 ↗(On Diff #43214)

I believe this method is unused.

lib/Target/AMDGPU/SIRegisterInfo.h
28–30 ↗(On Diff #43214)

Perhaps a constant or enum could be used here instead of the magic 10.

I agree with you doing with blocks is not optimal.
However it is one way to make the problem simpler to the scheduler and achieve better results.

Ideally some analysis could be used, as you say, to make some good decisions for the scheduling.
However I couldn't find any such good analysis pass yet, and how it could be used.

Grouping the instructions together is not enforced by this scheduler. In fact, one can have a variant that does one instruction == one block,
however so far it gives worse results, sign that this helps the second part of the schedule doing better decisions.

The thing that gave me the idea of blocks is that when you look at the original TGSI you can easily in your mind make some sub-blocks of instruction, and place the high latency instructions futher away form the users and come up with a schedule. I hoped that pass would rebuild such blocks easy to manipulate.

One thing hard during the schedule is when you have ready for schedule a lot of different instructions that all increase register usage.
Which instructions should you schedule now, in order to be able to contain register usage and decrease its usage soon ?
One solution would be to look at a lot of possible schedule order for the k next iterations and take the best one, but that's expensive.
The problem is very likely NP-hard.

Doing with blocks is a way to make it simpler for the scheduler: a relatively good decision has already been made for what are the k next instructions after each one.

lib/Target/AMDGPU/SIMachineScheduler.cpp
38 ↗(On Diff #43214)

This was a try to explain that less registers used => latencies are better hidden (better wavecounts)

42–45 ↗(On Diff #43214)

Imagine you have 100 instructions v_mov REG, CONSTANT, and one VMEM instruction loading two registers.
If you run out of registers and the default scheduler decides to try reduce register usage,
it will schedule all the v_mov instructions first before the VMEM instruction, because it doesn't increase reg usage as much.

If everything else depends on that VMEM, that results in hugh vgpr usage increase.

473 ↗(On Diff #43214)

I don't get the second part of the comment ? Should I replace the boolean by an enum ?

827 ↗(On Diff #43214)

This is for nodes which have
CurrentBottomUpReservedDependencyColoring[SU->NodeNum] ==
CurrentTopDownReservedDependencyColoring[SU->NodeNum] == 0

for example if we have (A high lat instruction)
A = op X2
X3 = op A X1

X1 doesn't have any high latency instruction dependency in both ways.

Without this function, all instructions in X1 case are put in the same block.
The pass puts them in different blocks, according to the set of dependencies in the already given colors.

lib/Target/AMDGPU/SIMachineScheduler.h
226–230 ↗(On Diff #43214)

Reserved: You can only add elements
Non-Reserved: you can change the color of all the elements if you wish.

If you have better names, don't hesitate...

axeldavy updated this revision to Diff 44452.Jan 11 2016, 2:19 AM
axeldavy marked 3 inline comments as done.

Here is new version of the patch with the last comments taken into account.

axeldavy updated this revision to Diff 44559.Jan 11 2016, 2:27 PM

Add a test with --misched=si

axeldavy updated this revision to Diff 44691.Jan 12 2016, 3:56 PM

Add some getters to protect some variables.

nhaehnle accepted this revision.Jan 12 2016, 6:42 PM
nhaehnle added a reviewer: nhaehnle.

Thanks for taking care of those changes!

I'm still only lukewarm to the scheduler because the blocks business seems a bit magical, and it would be nice to have a proper explanation why this is better and (more importantly) why the same cannot be achieved by improving on the existing scheduler with less of a "change the whole world" approach.

That said, the evidence is clearly that there are graphics performance improvements to be had with this scheduler, which I don't think we should keep from people. Furthermore, the change is low risk because it's nice and modular.

So all in all, I believe this should be accepted, and I will commit it soon.

This revision is now accepted and ready to land.Jan 12 2016, 6:42 PM
This revision was automatically updated to reflect the committed changes.