Page MenuHomePhabricator

[Refactor] Add the SchedHeuristic for Scheduler to allow platform customizing the heuristics
Needs ReviewPublic

Authored by steven.zhang on Aug 26 2020, 11:53 PM.

Details

Reviewers
fhahn
evandro
arsenm
rampitec
foad
kbarton
jsji
kerbowa
dmgreen
Group Reviewers
Restricted Project
Summary

Hi, all, I want to add a new SchedHeuristic to track all the heuristic Scheduler used basing on the follow facts.

  1. It is hard for target to add new heuristic now, as the Reason and tryCandidate() are maintained by base class.
  2. It is hard for target to change the priority of each heuristic(i.e. prefer cluster over stall) as it is hard code by tryCandidate.
  3. We assume the value of the Reason matches the priority in the tryCandidate which is easy to be out of sync. And it is out of sync now in fact. (see Stall and TopDepthReduce, we don't know their priority as it is changed according to input)
  4. The duplicated heuristic between pre-ra and post-ra scheduler. (There are many duplicate code between these two tryCandidate)

As PowerPC target want to add more and more heuristic, and we are struggling to deal with this, so, this is the motivation of this refactor and welcome for comments.

What I did now:

  • Each heuristic is an instance of the SchedHeuristic
  • Predefine all the heuristic we have now.
  • All the active heuristics are put into a vector and its position in the vector means its priority.
  • To add a new heuristic, just need to define a new instance of SchedHeuristic with Name and heuristic call back, then, call registerHeuristic() into any position of the vector.

Diff Detail

Event Timeline

steven.zhang created this revision.Aug 26 2020, 11:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 26 2020, 11:53 PM
steven.zhang requested review of this revision.Aug 26 2020, 11:53 PM
steven.zhang added a reviewer: Restricted Project.Aug 26 2020, 11:59 PM
foad added a comment.Aug 27 2020, 2:25 AM

It is hard for target to add new heuristic now, as the Reason and tryCandidate() are maintained by base class.
It is hard for target to change the priority of each heuristic(i.e. prefer cluster over stall) as it is hard code by tryCandidate.

I agree. For the AMDGPU scheduler I am very interested in latency and register pressure, and not very interested in the other heuristics. I look forward to be being able to tweak this in the target sched strategy.

qiucf added a subscriber: qiucf.Aug 27 2020, 3:29 AM
qiucf added inline comments.
llvm/include/llvm/CodeGen/MachineScheduler.h
957

So every time we'll need to iterate over this list? Maybe we can add a field like rank in SchedHeuristic and auto-fill them in registering?

llvm/lib/CodeGen/MachineScheduler.cpp
2935

Use #define STR(X) #X?

3047

Use StringSwitch for below?

Add helper macro to make the code more tidy and remove the tryGreater as it is not needed any more.

Address reviewer's comments.

It is hard for target to add new heuristic now, as the Reason and tryCandidate() are maintained by base class.
It is hard for target to change the priority of each heuristic(i.e. prefer cluster over stall) as it is hard code by tryCandidate.

I agree. For the AMDGPU scheduler I am very interested in latency and register pressure, and not very interested in the other heuristics. I look forward to be being able to tweak this in the target sched strategy.

Yes, PowerPC also has some specific requirement on the sched strategy.

Reimplement with class instead of object for heuristics.

Gentle ping ...

Hello. This looks interesting. I've run into places before where we just wanted to add an extra heuristics. I think some of our downstream schedules go a bit beyond this and would still need to override tryCandidate (the merge is going to be a pain!), but I've run into situations where it would definitely be useful.

This does add some non-inlineable virtual dispatch to a rather hot part of the compiler though. Do you have compile time numbers, to make sure this isn't going to cause any problems?

llvm/include/llvm/CodeGen/MachineScheduler.h
914

I would personally just keep everything as protected.

llvm/lib/Target/PowerPC/PPCMachineScheduler.h
39–40

None of these seems to do anything, beyond calling the base class. It's almost a shame to have to create a subclass of PostGenericScheduler just to add an extra heuristic. As opposed to being able to register them like mutations.

I imagine that a lot of the time these would end up doing something though and the scheduler would end up holding some state. That would be the case for our downstream schedulers I think. So I think this is fine as-is, but you might want to think a bit about how it would be possible to register without needing to subclass.

Hello. This looks interesting. I've run into places before where we just wanted to add an extra heuristics. I think some of our downstream schedules go a bit beyond this and would still need to override tryCandidate (the merge is going to be a pain!), but I've run into situations where it would definitely be useful.

Yes, our downstream scheduler also has the same pain. And what we can do now is to duplicate the tryCandidate and update it, which is hard to maintain. See D86681 .

This does add some non-inlineable virtual dispatch to a rather hot part of the compiler though. Do you have compile time numbers, to make sure this isn't going to cause any problems?

I will take this concern into consideration and evaluate the compiling time.

llvm/include/llvm/CodeGen/MachineScheduler.h
914

We have to declare the DAG as private as the subclass GenericScheduler/PostGenericScheduler both already keep it with different type. For the Heuristics, yes, we can make it as protected.

llvm/lib/Target/PowerPC/PPCMachineScheduler.h
39–40

Good suggestion. I think, we can keep the state inside the heuristic and add some necessary interface to initialize the state, update the state, and compare the candidate. Then, we can do it the same as mutation without subclass the scheduler. I will try with this direction.

steven.zhang updated this revision to Diff 290069.EditedSep 4 2020, 11:52 PM
steven.zhang retitled this revision from [NFC][Refactor] Add the SchedHeuristic for Scheduler to allow platform customizing the heuristics to [Refactor] Add the SchedHeuristic for Scheduler to allow platform customizing the heuristics.

Address reviewer's comments.

The Scheduler Strategy owns the SchedHeuristics, and the PreRA/PostRA Scheduler owns the Scheduler Strategy. So, we are now registering the heuristics inside Scheduler, so that, the target don't need to subclass the Scheduler Strategy.

However, as the scheduler heuristics might change from scheduler regions(we will override the region policy before scheduling the region), we have to subclass the Scheduler to register heuristics whenever we are entering a new sched region.

Regarding to the compiling time, yes, the virtual dispatch will increase compiling time (63s ->76s) from my testing. Fix this by using function object, and I didn't see compiling time deg now.

Regarding to the PowerPC tests change, it is a known issue of our scheduler heuristic, which is addressed by D86681. I abandon D86681 and merge it into this patch. If have concern on this, I can split it. The test change in this patch is completely the same as D86681.

Add two callbacks inside SchedHeuristic for target to maintain the internal state. @dmgreen Does it work for your downstream scheduler ?

  • initialize (It is called whenever we are entering a new sched region)
  • update (It is called whenever an instruction is scheduled.)

This is the typical case for target to add new heuristic:

  • subclass the SchedHeuristic and add private state as class member variable.
  • SchedHeuristic::initialize() Reset the state
  • SchedHeuristic::update() Update the state with the new DAG
  • SchedHeuristic::tryCandidate() Select the instruction according to current state.

Our downstream schedulers override pickNode/pickNext and have a certain amount of lookahead. We probably wouldn't be able to make use of this. I have in the past wanted to write a heuristic that this would fit this really nicely - it needed the last scheduled instruction and would bias based on whether they could "overlap". Unfortunately I never managed to get it to actually make performance reliably better.

I am a little surprised that, if virtual dispatch was hurting performance so much, function pointer would end up being a lot better. It's still a similar indirect non inlinable/optimisable function call.

llvm/include/llvm/CodeGen/MachineScheduler.h
1148

These are lowercase because they act like functions, not like variables?

llvm/lib/CodeGen/MachineScheduler.cpp
1540

You can probably just add an extra tracePick function, removing the need for this heuristic.

steven.zhang updated this revision to Diff 290170.EditedSep 6 2020, 8:42 PM

I am a little surprised that, if virtual dispatch was hurting performance so much, function pointer would end up being a lot better. It's still a similar indirect non inlinable/optimisable function call.

I think, the virtual dispatch has two memory dereference(one for vtable and another one for virtual func) while the function pointer only need one. And I am not sure if compiler can do some better optimization if there isn't any virtual function for a class. I double confirm and it indeed shows that, the new implementation didn't have any compiling time deg now.

Address other comments.

steven.zhang added inline comments.Sep 6 2020, 8:44 PM
llvm/include/llvm/CodeGen/MachineScheduler.h
1148

Yes. I add member function to avoid such confusion.

llvm/lib/CodeGen/MachineScheduler.cpp
1540

Good idea. Done.

This seems OK to me, if you can get some others to agree.

Thank you Dave. Can someone from other target have a look at this refactor to see if there is any problems ?

Gentle ping ...

foad added inline comments.Sep 24 2020, 7:20 AM
llvm/include/llvm/CodeGen/MachineScheduler.h
466

Typo disbaled.

Address comments.

steven.zhang marked an inline comment as done.Thu, Oct 1, 2:53 PM

Gentle ping...