This is an archive of the discontinued LLVM Phabricator instance.

[MachineScheduler] Try to issue the load instruction preferentially
AbandonedPublic

Authored by Allen on Jul 16 2022, 1:54 AM.

Details

Summary

Based on the discussion, it is preferable to hoist the loads as much as possible after register allocation.
https://discourse.llvm.org/t/insn-schedule-is-it-reasonable-to-issue-the-load-instruction-preferentially/63674

Diff Detail

Event Timeline

Allen created this revision.Jul 16 2022, 1:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2022, 1:54 AM
Allen requested review of this revision.Jul 16 2022, 1:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2022, 1:54 AM
fhahn added a comment.Jul 20 2022, 1:15 AM

Do you have any numbers on the benefit of that? I am not sure if using a target feature here is a good idea. If it is better to schedule aggressively for latency then this should probably be driven by the scheduling model.

Allen added a comment.Jul 20 2022, 1:33 AM

hi @fhahn , I'll gain benefits base on the tsv110 target (kunpeng 920), butI'm not sure whether this feature will have a negative performance impact on other architectures, so I add the target feature, does it reasonable ?

Sorry for the delay. My immediate thought was why not just increase the latency on the load. But I guess this is different. Most cpus can only issue a certain number of loads per cycle, so "load;load;load;load;load;add;add;add;add" will be worse than emitting "load;add;load;add;load;add;load;add". It will depend on the latencies of the loads though and any dependencies, most time we encode an optimistic latency into the schedule latencies for loads.

Most cpus will out-of-order execute around that anyway nowadays though, just filling up internal queues. It depends on how much they are fetching per cycle and what is acting as the bottleneck. So I wanted to run some benchmarks to see if this did modify anything, but they didn't show much change.

Umm. Is this for a downstream cpu or is it using tsv110? Does the scheduling model have the correct information for the loads in question? I worry that some of this code we are adding looks like dead code that someone could rightly delete as unused. Should they be added to the tuning features for the tsv110 cpu?

Allen planned changes to this revision.Jul 20 2022, 2:34 AM

Thank you for your suggestion, it is true that this unconditionally preferentially launch of the load instruction may be too aggressive.
Can you give me some guidance about the the function of "optimistic latency into the schedule latencies for loads code which function is", I can debug to see if a new solution can be used for improvement ?

The scheduling info will come from https://github.com/llvm/llvm-project/blob/439668871ac992159f00309d3bd837db287bdea6/llvm/lib/Target/AArch64/AArch64SchedTSV110.td#L501. It tends to assume a L1 cache latency, as setting it much high tends to push the scheduler into making worse decisions.
It may be better to be a little aggressive after post-ra. I don't know the cpu very well.

nhaehnle resigned from this revision.Jul 20 2022, 6:15 AM
dewen added a subscriber: dewen.Dec 21 2022, 12:03 AM
Matt added a subscriber: Matt.Dec 27 2022, 8:15 AM
Allen added a comment.Jan 29 2023, 6:41 PM

I find it is A57UnitL Resource limited in the A57 backend.
So, after I temporarily tried to adjust the A57UnitL number to 2, the load instruction would be issue first.
In fact, the current architecture is not VLIW, and only one instruction can be sent in a cycle, so will the A57UnitL really have a resource bottleneck in the pipeline?

  • def A57UnitL : ProcResource<2>; // Type L micro-ops
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 0 1 5 
  TopQ.A RemainingLatency 0 + 0c > CritPath 17
  Cand SU(0) ORDER                              
Pick Top TOP-PATH  
Scheduling SU(0) renamable $q1, renamable $q2 = LDPQi renamable $x9, -1 :: (load (s128) from %ir.scevgep10, align 8), (load (s128) from %ir.lsr.iv79, align 8)
  Ready @0c
  A57UnitL +2x6u
  *** Critical resource A57UnitL: 2c
  TopQ.A BotLatency SU(0) 17c
  *** Max MOps 3 at cycle 0
Cycle: 1 TopQ.A
TopQ.A @1c
  Retired: 3
  Executed: 2c
  Critical: 2c, 2 A57UnitL
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
Queue TopQ.P: 
Queue TopQ.A: 5 1 7 
  TopQ.A RemainingLatency 0 + 1c > CritPath 17
  TopQ.A ResourceLimited: A57UnitL
  Cand SU(5) ORDER                              
Pick Top RES-REDUCE
Scheduling SU(5) renamable $x2 = SUBSXri renamable $x2, 4, 0, implicit-def $nzcv
  Ready @1c
  A57UnitI +1x3u
TopQ.A @1c
  Retired: 4
  Executed: 2c
  Critical: 2c, 2 A57UnitL
  ExpectedLatency: 0c
  - Resource limited.
** ScheduleDAGMI::schedule picking next node
Allen abandoned this revision.Apr 2 2023, 10:03 PM

Abandon as I think the impovement of target schedule model may be a better way as comment suggested