This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Refines the Cortex-A57 Machine Model
ClosedPublic

Authored by cestes on Sep 16 2014, 1:53 PM.

Details

Summary

The largest refinement is model the Cortex-A57 as an in-order
machine to help ensure that the maximum number of micro-ops are
issued into the out-of-order stages on every cycle. By modeling
it as a 3-wide in-order machine, it helps to ensure that each of
the 3 microps that are decoded and dispatched per cycle are able to
be issued immediately.

Secondly, a few advanced features are modeled, including forwarding
for MAC instructions and hazards for floating point SQRT and DIV.

Lastly, all of the instructions with inaccurate latency or micro
op information are refined to be as accurate as possible.
These refinements are largely for the NEON instructions.

Diff Detail

Event Timeline

cestes updated this revision to Diff 13763.Sep 16 2014, 1:53 PM
cestes retitled this revision from to [AArch64] Refines the Cortex-A57 Machine Model.
cestes updated this object.
cestes edited the test plan for this revision. (Show Details)

FWIW, Dave and I discussed this offline. IMO, this is actually three separate patches and should be committed as such. However, I told him to go ahead and post a single patch to simplify the review process. I'll defer to Andy/others to decided if they'd like to review separate patches, however.

Chad

atrick accepted this revision.Sep 16 2014, 4:37 PM
atrick edited edge metadata.

Setting IssueWidth=3 is correct. That really means how many micro-ops can be "handled" per cycle. So it should be the minimum of decode/issue width. To be precise, we should have a decodeWidth that counts instructions, but I never bothered to add it since IssueWidth can serve the same purpose.

I don't think MinLatency is being used anymore by the generic machine scheduler. I made a note to rip that out.

MicroOpBufferSize determines in-order modeling of latency. It's your machine, so if you want to model it as in-order and get better results, then I can't argue!

You could go even further and model the in-order stalls on functional units that are not fully pipelined by setting BufferSize=0.
Note that you can have a mix of in-order/out-of-order resources if you choose.

You can also model just a certain class of instructions as having in-order latency by boosting MicroOpBufferSize and setting BufferSize=1. You can have a class of instructions consume multiple resources so you could model both in-order resource contention and latency.

Note that the idea behind modeling out-of-order is that we don't want an instruction issue limitation to be modeled as a hard stall that preempts all other heuristics. There are thresholds and heuristics that then come into play to try to balance resources. However, the default heuristics are very conservative, in the sense that the schedule is preserved unless we suspect a real stall (first do no harm). Given the scheduler only sees a single block, it often doesn't do anything to improve issue bandwidth on an aggressive OOO model. The scheduler could be improved by recognizing loops, inferring a steady cpu state and adjusting heuristics. I've added some loop awareness to the heuristics but it could be much better.

Since you have plenty of registers, scheduling in-order probably doesn't often hurt and is occasionally useful depending on how effective the hardware is at balancing instruction dispatch. You'll probably see a lot of unnecessary shuffling with in-order scheduling, but if you get better performance, then it's worth it.

One thing you will notice is that interdependent instructions will no longer be scheduled in the same 3-wide decoding group. Since we're not inserting nops, it's probably not a big deal though.

This revision is now accepted and ready to land.Sep 16 2014, 4:37 PM
Jiangning added inline comments.Sep 16 2014, 11:10 PM
lib/Target/AArch64/AArch64SchedA57.td
523

Where are FN?MUL[DS]rr ?

Jiangning edited edge metadata.Sep 17 2014, 1:21 AM

Hi Dave,

I tried your patch on ToT, and got the following result. (negative number is good).

spec.cpu2000.ref.175_vpr -1.10%
spec.cpu2000.ref.177_mesa -2.46%
spec.cpu2000.ref.179_art 1.96%
spec.cpu2000.ref.183_equake 4.30%
spec.cpu2000.ref.252_eon 2.06%
spec.cpu2000.ref.254_gap 1.59%
spec.cpu2000.ref.256_bzip2 1.49%
spec.cpu2000.ref.300_twolf 3.71%

Somehow we see regressions for spec2000.

Thanks,
-Jiangning

cestes edited edge metadata.Sep 17 2014, 8:17 AM
cestes added a subscriber: Unknown Object (MLST).

I'm seeing strong improvements for Spec2000 on device here, so I'll try ToT too and get to the bottom of this.

Thanks.

I tried your patch on ToT, and got the following result. (negative number is good).

spec.cpu2000.ref.175_vpr -1.10%
spec.cpu2000.ref.177_mesa -2.46%
spec.cpu2000.ref.179_art 1.96%
spec.cpu2000.ref.183_equake 4.30%
spec.cpu2000.ref.252_eon 2.06%
spec.cpu2000.ref.254_gap 1.59%
spec.cpu2000.ref.256_bzip2 1.49%
spec.cpu2000.ref.300_twolf 3.71%

lib/Target/AArch64/AArch64SchedA57.td
523

Thanks for the feedback, Jiangning.

In this case, FN?MUL[DS]rr instructions don't have a specific InstRW, because their default WriteFMul has been mapped to the correct specific SchedWrite already, A57Write_5cyc_1V. I only use InstRWs to refine instructions that aren't correct with the default mappings.

Setting IssueWidth=3 is correct. That really means how many micro-ops can be "handled" per cycle. So it should be the minimum of decode/issue width. To be precise, we should have a decodeWidth that counts instructions, but I never bothered to add it since IssueWidth can serve the same purpose.

Thanks for the clarification.

MicroOpBufferSize determines in-order modeling of latency. It's your machine, so if you want to model it as in-order and get better results, then I can't argue!

You could go even further and model the in-order stalls on functional units that are not fully pipelined by setting BufferSize=0.
Note that you can have a mix of in-order/out-of-order resources if you choose.

I figured there was some tradeoffs with modeling purely in-order, but the gains were so broadly beneficial that it was a no brainer. I really want to do just this and model both the in-order and out-of-order portions of the pipelines for each instructions. It wasn't immediately obvious how to do it, so I temporarily shelved the idea. Might be a nice experiment for a proposed SchedMachineModel tutorial. :)

You can also model just a certain class of instructions as having in-order latency by boosting MicroOpBufferSize and setting BufferSize=1. You can have a class of instructions consume multiple resources so you could model both in-order resource contention and latency.

Note that the idea behind modeling out-of-order is that we don't want an instruction issue limitation to be modeled as a hard stall that preempts all other heuristics. There are thresholds and heuristics that then come into play to try to balance resources. However, the default heuristics are very conservative, in the sense that the schedule is preserved unless we suspect a real stall (first do no harm). Given the scheduler only sees a single block, it often doesn't do anything to improve issue bandwidth on an aggressive OOO model. The scheduler could be improved by recognizing loops, inferring a steady cpu state and adjusting heuristics. I've added some loop awareness to the heuristics but it could be much better.

I really like this idea of adjusting heuristics. Think this is something that PGO can also help with?

Since you have plenty of registers, scheduling in-order probably doesn't often hurt and is occasionally useful depending on how effective the hardware is at balancing instruction dispatch. You'll probably see a lot of unnecessary shuffling with in-order scheduling, but if you get better performance, then it's worth it.

One thing you will notice is that interdependent instructions will no longer be scheduled in the same 3-wide decoding group. Since we're not inserting nops, it's probably not a big deal though.

Thanks again for all of the clarification, Andy.

Jianging,

I did some more runs and I've got mixed news. Seems I've been a bit more
focused on this new model's gains over -mcpu=generic rather than using
the existing A57 model as a baseline. The reason was primarily because
our earlier testing showed the existing A57 model performing very
poorly. However, I re-did my runs using the existing A57 model as a
baseline and it actually performs really well. So that's the good news.
The mediocre news is that increasing the accuracy of the model has
merely shifted performance around and not actually increased it.

With that said, I'm going to do some more experimenting and then I'm
going to try to model the in-order and out-of-order resources
accordingly in a hope that I can capture the best gains. It might take a
bit of time, but I'll hopefully replace this patch with those efforts.

In the meantime, I might have some questions for you guys but I'll take
that chatter off-list.

-Dave

jmolloy edited edge metadata.Sep 18 2014, 9:37 AM

Hi Dave,

I’ve discovered that we should be running the FPLoadBalancing pass AFTER
the Post-RA scheduler. We aren’t, and I thought we were.

The FPLoadBalancing pass is sensitive to instruction order - a permutation
such as might be expected if the post-RA scheduler does its job could
cause worse performance.

I’ve looked into switching it to later but it exposed a couple of bugs, so
I’m working on fixing those first.

Cheers,
James

cestes updated this revision to Diff 14018.Sep 23 2014, 2:01 PM
cestes edited edge metadata.

Update changes from 3-way issue in-order to 3-way issue out-of-
order.

All,

This new patchset moves the model back to out-of-order yet restricts the issue-width to the minimum of the actual issue width and dispatch width as Andy suggested. It brought the Spec2000/2006 numbers back up and even outperformed the original model by a few percent (geomean). It also improved the EEMBC numbers by a percent (geomean). I did see some degradation in individual tests, but nothing horrible. It will take some more detailed analysis to determine the cause there.

Jiangning,

In the meantime, if you can replicate the performance gain, then I'd like to move forward with this review, because the more accurate latency information will be key to future analysis and refinements.

Thanks...
-Dave

Dave,

I'm running benchmark and will let you know the result as soon as I got it.

Thanks,
-Jiangning

2014-09-24 22:47 GMT+08:00 Dave Estes <cestes@codeaurora.org>:

All,

This new patchset moves the model back to out-of-order yet restricts the
issue-width to the minimum of the actual issue width and dispatch width as
Andy suggested. It brought the Spec2000/2006 numbers back up and even
outperformed the original model by a few percent (geomean). It also
improved the EEMBC numbers by a percent (geomean). I did see some
degradation in individual tests, but nothing horrible. It will take some
more detailed analysis to determine the cause there.

Jiangning,

In the meantime, if you can replicate the performance gain, then I'd like
to move forward with this review, because the more accurate latency
information will be key to future analysis and refinements.

Thanks...
-Dave

http://reviews.llvm.org/D5372

Hi Dave,

The new version shows good potential, I think.

spec.cpu2000.ref.300_twolf -4.44%
spec.cpu2000.ref.175_vpr -2.58%
spec.cpu2000.ref.255_vortex -1.39%
spec.cpu2000.ref.254_gap 1.40%
spec.cpu2000.ref.183_equake 2.88%

Thanks,
-Jiangning

Excellent. Thanks, Jiangning. Your new numbers show a regressions in gap and equake. I'll try to get an equake number, but I do know that we're seeing ~1% gain. Interestingly enough one of the regressions that we're seeing is on twolf, but your device shows a gain. :) Despite the differences, I too think this latest patch looks like a good foundation for future work.

If I can get a fresh LGTM, I'll get it committed.

cestes closed this revision.Sep 29 2014, 2:40 PM

Committed as r218627.