Based on the discussion, it is preferable to hoist the loads as much as possible after register allocation.
https://discourse.llvm.org/t/insn-schedule-is-it-reasonable-to-issue-the-load-instruction-preferentially/63674
Details
Diff Detail
Unit Tests
Event Timeline
Do you have any numbers on the benefit of that? I am not sure if using a target feature here is a good idea. If it is better to schedule aggressively for latency then this should probably be driven by the scheduling model.
hi @fhahn , I'll gain benefits base on the tsv110 target (kunpeng 920), butI'm not sure whether this feature will have a negative performance impact on other architectures, so I add the target feature, does it reasonable ?
Sorry for the delay. My immediate thought was why not just increase the latency on the load. But I guess this is different. Most cpus can only issue a certain number of loads per cycle, so "load;load;load;load;load;add;add;add;add" will be worse than emitting "load;add;load;add;load;add;load;add". It will depend on the latencies of the loads though and any dependencies, most time we encode an optimistic latency into the schedule latencies for loads.
Most cpus will out-of-order execute around that anyway nowadays though, just filling up internal queues. It depends on how much they are fetching per cycle and what is acting as the bottleneck. So I wanted to run some benchmarks to see if this did modify anything, but they didn't show much change.
Umm. Is this for a downstream cpu or is it using tsv110? Does the scheduling model have the correct information for the loads in question? I worry that some of this code we are adding looks like dead code that someone could rightly delete as unused. Should they be added to the tuning features for the tsv110 cpu?
Thank you for your suggestion, it is true that this unconditionally preferentially launch of the load instruction may be too aggressive.
Can you give me some guidance about the the function of "optimistic latency into the schedule latencies for loads code which function is", I can debug to see if a new solution can be used for improvement ?
The scheduling info will come from https://github.com/llvm/llvm-project/blob/439668871ac992159f00309d3bd837db287bdea6/llvm/lib/Target/AArch64/AArch64SchedTSV110.td#L501. It tends to assume a L1 cache latency, as setting it much high tends to push the scheduler into making worse decisions.
It may be better to be a little aggressive after post-ra. I don't know the cpu very well.
I find it is A57UnitL Resource limited in the A57 backend.
So, after I temporarily tried to adjust the A57UnitL number to 2, the load instruction would be issue first.
In fact, the current architecture is not VLIW, and only one instruction can be sent in a cycle, so will the A57UnitL really have a resource bottleneck in the pipeline?
- def A57UnitL : ProcResource<2>; // Type L micro-ops
** ScheduleDAGMI::schedule picking next node Queue TopQ.P: Queue TopQ.A: 0 1 5 TopQ.A RemainingLatency 0 + 0c > CritPath 17 Cand SU(0) ORDER Pick Top TOP-PATH Scheduling SU(0) renamable $q1, renamable $q2 = LDPQi renamable $x9, -1 :: (load (s128) from %ir.scevgep10, align 8), (load (s128) from %ir.lsr.iv79, align 8) Ready @0c A57UnitL +2x6u *** Critical resource A57UnitL: 2c TopQ.A BotLatency SU(0) 17c *** Max MOps 3 at cycle 0 Cycle: 1 TopQ.A TopQ.A @1c Retired: 3 Executed: 2c Critical: 2c, 2 A57UnitL ExpectedLatency: 0c - Resource limited. ** ScheduleDAGMI::schedule picking next node Queue TopQ.P: Queue TopQ.A: 5 1 7 TopQ.A RemainingLatency 0 + 1c > CritPath 17 TopQ.A ResourceLimited: A57UnitL Cand SU(5) ORDER Pick Top RES-REDUCE Scheduling SU(5) renamable $x2 = SUBSXri renamable $x2, 4, 0, implicit-def $nzcv Ready @1c A57UnitI +1x3u TopQ.A @1c Retired: 4 Executed: 2c Critical: 2c, 2 A57UnitL ExpectedLatency: 0c - Resource limited. ** ScheduleDAGMI::schedule picking next node
Abandon as I think the impovement of target schedule model may be a better way as comment suggested