Page MenuHomePhabricator

AMDGPU/SI: Add SI Machine Scheduler
ClosedPublic

Authored by axeldavy on Aug 9 2015, 6:01 AM.

Details

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
arsenm added inline comments.Sep 14 2015, 2:37 PM
lib/Target/AMDGPU/SIInstrInfo.cpp
2732–2734 ↗(On Diff #34135)

This can return the condition without the if.

This also isn't 100% accurate. This will miss barriers and a few other cases. Better would be to check the instruction latency maybe with a threshold value of some sort.

lib/Target/AMDGPU/SIMachineScheduler.cpp
128 ↗(On Diff #34135)

Why are these padded with spaces at the end?

134 ↗(On Diff #34135)

You don't need a ; here

174 ↗(On Diff #34135)

Looks like a stray comment

200–201 ↗(On Diff #34135)

This sentence reads weirdly / seems to be missing some words

How about:

"Low latency instructions which do not depend on other low latency instructions we haven't waited for"

202–203 ↗(On Diff #34135)

Ditto

209 ↗(On Diff #34135)

"If never in the block there is a lot of constant loads" -> "If there are not many constant loads in the block"

217 ↗(On Diff #34135)

This shouldn't be hardcoded. A pass parameter or command line option read during pass initialization

245–246 ↗(On Diff #34135)

Did you mean for these to be copies?

247 ↗(On Diff #34135)

Capitalize and period

249 ↗(On Diff #34135)

Can you use this instead of repeating (*I)-> multiple times?

259 ↗(On Diff #34135)

Dead code. Uncomment and maybe add more of a description if this is useful debug printing

273–274 ↗(On Diff #34135)

This can be a range for

291 ↗(On Diff #34135)

Function would be better named isDefBetween since it doesn't return the def.

Something about this function doesn't seem right to me. I don't think you should be looking at the def_instructions, and instead checking the Segments of the LiveRange.

295–297 ↗(On Diff #34135)

This can be a range loop over MRI->def_instructions()

320–321 ↗(On Diff #34135)

This can be a range for

371 ↗(On Diff #34135)

isVirtualRegister should be checked first

400–401 ↗(On Diff #34135)

Range for

419–420 ↗(On Diff #34135)

Can you factor this into a separate check predicate function to assert on

423–424 ↗(On Diff #34135)

Range for

426–427 ↗(On Diff #34135)

No space between assert and (

435–436 ↗(On Diff #34135)

Range for

507–508 ↗(On Diff #34135)

Capitalize / period

526–527 ↗(On Diff #34135)

Range for

539–544 ↗(On Diff #34135)

This could be an std::find or factored into own function

546–552 ↗(On Diff #34135)

Asserts should not be placed under DEBUG(). It would be better to put this into a separate function and assert on it

558–563 ↗(On Diff #34135)

Similar std::find or separate function again

591–592 ↗(On Diff #34135)

Range for

597–598 ↗(On Diff #34135)

Single quotes around the single characters

650–652 ↗(On Diff #34135)

No return after else

710 ↗(On Diff #34135)

Space around -

850 ↗(On Diff #34135)

.empty()

870 ↗(On Diff #34135)

Having a set as the key to a map looks very strange. Is this really what you want?

911–912 ↗(On Diff #34135)

Should use C++ style comments

1040–1041 ↗(On Diff #34135)

This can be in a single DEBUG() statement

1081 ↗(On Diff #34135)

Non English comment

1123–1134 ↗(On Diff #34135)

This should be another separate function

1151–1152 ↗(On Diff #34135)

Range for

1242–1247 ↗(On Diff #34135)

No asserts under DEBUG()

1355–1358 ↗(On Diff #34135)

Single quote ' ' and '\n'

lib/Target/AMDGPU/SIMachineScheduler.h
26–27 ↗(On Diff #34135)

This is wrapped weirdly. I would prefer one item per line

34 ↗(On Diff #34135)

This looks like it should be an llvm::BitVector

213–214 ↗(On Diff #34135)

Can you put a line between each function and the comment on the following declaration

365–366 ↗(On Diff #34135)

These should return references

axeldavy added inline comments.Sep 30 2015, 1:07 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
2732–2734 ↗(On Diff #34135)

The instruction latency for all instructions is not filled yet, we cannot rely on it.

I don't think we need to say barriers are high latencies, because (unless I'm wrong)
they will separate the scheduling regions (the instruction is not moved).

lib/Target/AMDGPU/SIMachineScheduler.cpp
128 ↗(On Diff #34135)

looks like for now we don't need extra spacing. I'll remove them for the next patch.

174 ↗(On Diff #34135)

I just used it to signal below are the SIScheduleBlock functions. I'm ok to remove it if you don't like it.

291 ↗(On Diff #34135)

The function is an adaptation of RegistrePressure.cpp 's findUseBetween function, which was doing the same but with uses.
For Uses it makes sense to use use_instr_nodbg_begin because there doesn't seem to be other way with the LiveRange.

For Defs it looks like that it can be done with LiveRange.

Do you really want I try to use LiveRange instead ?
This version with def_instr_begin works and is fast enough.

295–297 ↗(On Diff #34135)

I didn't know about range for before you advised me to use them.

Is is possible to use them here, knowing there is a cast to MachineInstr* ?

517 ↗(On Diff #34135)

There was a mistake here, it shouldn't be SU->NodeNum but the Successor NodeNum.

My next patch will fix it.

axeldavy added inline comments.Sep 30 2015, 2:51 AM
lib/Target/AMDGPU/SIMachineScheduler.cpp
217 ↗(On Diff #34135)

I think this hardcoded value is suboptimal (as any hardcoded value would be) and that
the length of the block and other stuffs should be taken into account for that heuristic.

I don't think it makes sense to ask user to set this value.

I suggest to keep that hardcoded value for now, and replace that heuristic when we find a better one.

axeldavy added inline comments.Sep 30 2015, 9:22 AM
lib/Target/AMDGPU/SIMachineScheduler.cpp
650–652 ↗(On Diff #34135)

Do you mean I should just remove the else ?

870 ↗(On Diff #34135)

Yes I want to assign colors to different sets. This looked like the best way to do.

1556 ↗(On Diff #34135)

I use SITTI because TTI is already taken.

SITTI is TTI casted to SIInstrInfo*.
I use SITTI several times. It avoids casting again.

axeldavy added inline comments.Sep 30 2015, 9:47 AM
lib/Target/AMDGPU/SIRegisterInfo.cpp
33–34 ↗(On Diff #34135)

Unfortunately I didn't find any other simple way than comparing the names.
There is probably an alternative way which would be have llvm generate variables with the pressure sets names containing their ID, as is done for pressure classes, but I haven't investigated that way.

Something like 'if (StringRefs("SGPR_32") == getRegPressureSetName(i))' ?

axeldavy updated this revision to Diff 43211.Dec 18 2015, 3:01 AM

Updated diff. A lot of cleanups.

axeldavy marked 39 inline comments as done.Dec 18 2015, 3:15 AM
axeldavy updated this revision to Diff 43214.Dec 18 2015, 3:17 AM

Change uses of " " to ' '

axeldavy marked 2 inline comments as done.Dec 18 2015, 3:17 AM

I've tried to understand what this scheduler does. As a general comment, I think it would be useful for the review if you decided upon a single variant for each of the steps of the scheduler. Having different variants makes things harder to follow.

I do like the idea of explicitly keeping track of VMEM instructions, since they have both the highest latency and the best latency-mitigation tool. I have focused on the blocks so far, and I have to say I'm not convinced that they are the right way to go. Some high-level thoughts on the topic:

  1. As a general approach that would fit better into the MachineSchedStrategy framework as I understand it, how about a bottom-up scheduling in which after you schedule the instruction consuming the result of a VMEM load, you keep track of how many other instructions (or better: cycles) you have already scheduled to cover the latency, and use that in decision making.

1b) There will be dependency chains between VMEMs (dependent texture loads in graphics, more general patterns in compute). For such dependency chains, it seems like a good idea to do an analysis prepass (as part of the scheduler) which collects information about stuff like how many DAG nodes are strictly between two dependent VMEM loads. This kind of information can be useful: consider a dependent texture load with a single dependency step in a fragment shader that looks like this: (interp) -> image -> (long calculation) -> image -> (short calculation) -> exp. The fact that one of these calculations is short and the other long can be obtained by a DAG traversal similar to your coloring, and it can be used to decide that the second image instruction should not be scheduled too early (in program order, or too late in bottom-up scheduling order).

  1. I've noticed while playing around that the blocks are of very different sizes. This alone made me a bit suspicious.
  1. The blocks are a form of very early decision making, and as a general rule, very early decision making tends to lead to bad results in optimization, unless it is done very well.

None of this is necessarily the final nail in the coffin for the blocks idea. I'd just like to see a better justification why scheduling at an instruction level is really insufficient. (I do think that some bigger-picture early analysis of the DAG to take latency into account is likely to be very useful.)

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
25 ↗(On Diff #43214)

To keep recompile times lower, it would be good to get rid of this #include; AFAICT the only dependency is createSIMachineScheduler, which can be moved to SIMachineScheduler.cpp (and declared in AMDGPU.h).

lib/Target/AMDGPU/SIMachineScheduler.cpp
38 ↗(On Diff #43214)

What precisely do you mean by this last point?

42–45 ↗(On Diff #43214)

This is hard to understand. Do you have some example of what you mean exactly, and an explanation of why it happens in the generic scheduler?

128 ↗(On Diff #43214)

This function should be static.

473 ↗(On Diff #43214)

Since releaseSucc() is only used here, the logic of whether to actually release or not should be moved out of it and into this loop. It is confusing to have a method releaseXxx which does not actually always release Xxx.

The same actually holds for the releaseSuccessors method as a whole. This would be helped a lot if the parameter were an enum if descriptive names, such that the meaning of the parameter is clear at call sites.

717–753 ↗(On Diff #43214)

This algorithm will give different colors to SUnits with different dependencies, but it does not always give the same color to SUnits with the same dependencies. Consider:

High latency instructions A, B, C, followed by:
X1 = op A, B (has predecessors A, B, gets new color N)
X2 = op X1, C (has predecessor colors N, C, gets new color N + 1)
Y1 = op A, C (has predecessors A, C, gets new color N + 2)
Y2 = op Y1, B (has predecessor colors N + 2, B, gets new color N + 3)
X2 and Y2 have the same set of high latency predecessors, but are colored differently.

Not sure whether this is critical, but it's something to keep in mind.

757 ↗(On Diff #43214)

Grammar: same *as* before

800 ↗(On Diff #43214)

Use std::pair instad of std::set here.

827 ↗(On Diff #43214)

What is this method supposed to do? Which nodes are affected by it?

942 ↗(On Diff #43214)

This method and the two above it are extremely similar. It should be possible to merge them somehow.

lib/Target/AMDGPU/SIMachineScheduler.h
190 ↗(On Diff #43214)

Style: first letter of method name should be lower case

207 ↗(On Diff #43214)

Spelling: Grouped

226–230 ↗(On Diff #43214)

I do not understand the semantics of reserved vs. non-reserved. Why the distinction? What does it mean?

lib/Target/AMDGPU/SIRegisterInfo.cpp
669–684 ↗(On Diff #43214)

I believe this method is unused.

lib/Target/AMDGPU/SIRegisterInfo.h
28–30 ↗(On Diff #43214)

Perhaps a constant or enum could be used here instead of the magic 10.

I agree with you doing with blocks is not optimal.
However it is one way to make the problem simpler to the scheduler and achieve better results.

Ideally some analysis could be used, as you say, to make some good decisions for the scheduling.
However I couldn't find any such good analysis pass yet, and how it could be used.

Grouping the instructions together is not enforced by this scheduler. In fact, one can have a variant that does one instruction == one block,
however so far it gives worse results, sign that this helps the second part of the schedule doing better decisions.

The thing that gave me the idea of blocks is that when you look at the original TGSI you can easily in your mind make some sub-blocks of instruction, and place the high latency instructions futher away form the users and come up with a schedule. I hoped that pass would rebuild such blocks easy to manipulate.

One thing hard during the schedule is when you have ready for schedule a lot of different instructions that all increase register usage.
Which instructions should you schedule now, in order to be able to contain register usage and decrease its usage soon ?
One solution would be to look at a lot of possible schedule order for the k next iterations and take the best one, but that's expensive.
The problem is very likely NP-hard.

Doing with blocks is a way to make it simpler for the scheduler: a relatively good decision has already been made for what are the k next instructions after each one.

lib/Target/AMDGPU/SIMachineScheduler.cpp
38 ↗(On Diff #43214)

This was a try to explain that less registers used => latencies are better hidden (better wavecounts)

42–45 ↗(On Diff #43214)

Imagine you have 100 instructions v_mov REG, CONSTANT, and one VMEM instruction loading two registers.
If you run out of registers and the default scheduler decides to try reduce register usage,
it will schedule all the v_mov instructions first before the VMEM instruction, because it doesn't increase reg usage as much.

If everything else depends on that VMEM, that results in hugh vgpr usage increase.

473 ↗(On Diff #43214)

I don't get the second part of the comment ? Should I replace the boolean by an enum ?

827 ↗(On Diff #43214)

This is for nodes which have
CurrentBottomUpReservedDependencyColoring[SU->NodeNum] ==
CurrentTopDownReservedDependencyColoring[SU->NodeNum] == 0

for example if we have (A high lat instruction)
A = op X2
X3 = op A X1

X1 doesn't have any high latency instruction dependency in both ways.

Without this function, all instructions in X1 case are put in the same block.
The pass puts them in different blocks, according to the set of dependencies in the already given colors.

lib/Target/AMDGPU/SIMachineScheduler.h
226–230 ↗(On Diff #43214)

Reserved: You can only add elements
Non-Reserved: you can change the color of all the elements if you wish.

If you have better names, don't hesitate...

axeldavy updated this revision to Diff 44452.Jan 11 2016, 2:19 AM
axeldavy marked 3 inline comments as done.

Here is new version of the patch with the last comments taken into account.

axeldavy updated this revision to Diff 44559.Jan 11 2016, 2:27 PM

Add a test with --misched=si

axeldavy updated this revision to Diff 44691.Jan 12 2016, 3:56 PM

Add some getters to protect some variables.

nhaehnle accepted this revision.Jan 12 2016, 6:42 PM
nhaehnle added a reviewer: nhaehnle.

Thanks for taking care of those changes!

I'm still only lukewarm to the scheduler because the blocks business seems a bit magical, and it would be nice to have a proper explanation why this is better and (more importantly) why the same cannot be achieved by improving on the existing scheduler with less of a "change the whole world" approach.

That said, the evidence is clearly that there are graphics performance improvements to be had with this scheduler, which I don't think we should keep from people. Furthermore, the change is low risk because it's nice and modular.

So all in all, I believe this should be accepted, and I will commit it soon.

This revision is now accepted and ready to land.Jan 12 2016, 6:42 PM
This revision was automatically updated to reflect the committed changes.