This is an archive of the discontinued LLVM Phabricator instance.

[AARCH64] Improve accumulator forwarding for Cortex-A57 model
ClosedPublic

Authored by mnadeem on Nov 29 2020, 10:51 PM.

Details

Summary

The old CPU model only had MLA->MLA forwarding. I added some missing
MUL->MLA read advances and a missing absolute diff accumulator read
advance according to the Cortex A57 Software Optimization Guide.

The patch improves performance in EEMBC rgbyiqv2 by about 6%-7% and
spec2006/milc by 8% (repeated runs on multiple devices), causes no
significant regressions (none in SPEC).

Diff Detail

Event Timeline

mnadeem created this revision.Nov 29 2020, 10:51 PM
mnadeem requested review of this revision.Nov 29 2020, 10:51 PM
mnadeem edited the summary of this revision. (Show Details)Dec 1 2020, 1:49 PM
dmgreen added a subscriber: dmgreen.

Is it possible to add an llvm-mca test for the instructions that change here? Preferably showing all the instructions that are touched. If not an mir level test might be simple way to create some tests along the same lines (although there's nothing very wrong with the ll test you have, an mir test can be more targeted).

llvm/lib/Target/AArch64/AArch64SchedA57.td
96

Can you update this comment to why this is different, not why it has _changed_ (which doesn't mean a lot once the code is in-tree.)

So something like "Use a WriteRes as opposed to SchedAlias for advance lookup"

420

What is a ^[SU]SHLv1i8 ?
Same for all these others.

llvm/lib/Target/AArch64/AArch64SchedA57WriteRes.td
53

Where is this used?

mnadeem updated this revision to Diff 310603.Dec 9 2020, 12:08 PM
mnadeem edited the summary of this revision. (Show Details)

Replaced the .ll test with llvm-mca test, modified some comments and the summary.
Removed shift->shift-accumulate forwarding because the manual was not clear if its allowed.

mnadeem marked an inline comment as done.Dec 9 2020, 12:15 PM
mnadeem added inline comments.
llvm/lib/Target/AArch64/AArch64SchedA57.td
96

Removed the comment because the warning on line 68 explains the change.
I felt that the comment was redundant.

420

SSHL and USHL D-form.

https://godbolt.org/z/a7jo7a
The schedule was missing. These should only use the X Unit but seem to be using Units X/W

llvm/lib/Target/AArch64/AArch64SchedA57WriteRes.td
53

Removed it. That was a mistake.

dmgreen added inline comments.Dec 15 2020, 3:52 AM
llvm/lib/Target/AArch64/AArch64SchedA57.td
379–383

Do PMUL and sqdmulh have this forwarding? Same for other instructions like SQDMLAL below.

420

This can drop the v1iX (except i64). These are all the valid SSHL instructions, the bottom ones being the important ones.

SSHLLB_ZZI_D        = 4444,
SSHLLB_ZZI_H        = 4445,
SSHLLB_ZZI_S        = 4446,
SSHLLT_ZZI_D        = 4447,
SSHLLT_ZZI_H        = 4448,
SSHLLT_ZZI_S        = 4449,
SSHLLv16i8_shift    = 4450,
SSHLLv2i32_shift    = 4451,
SSHLLv4i16_shift    = 4452,
SSHLLv4i32_shift    = 4453,
SSHLLv8i16_shift    = 4454,
SSHLLv8i8_shift     = 4455,
SSHLv16i8   = 4456,        
SSHLv1i64   = 4457,        
SSHLv2i32   = 4458,        
SSHLv2i64   = 4459,        
SSHLv4i16   = 4460,        
SSHLv4i32   = 4461,        
SSHLv8i16   = 4462,        
SSHLv8i8    = 4463,
dmgreen added inline comments.Dec 15 2020, 3:56 AM
llvm/lib/Target/AArch64/AArch64SchedA57.td
416

This may be worth splitting off the non-forwarding parts into it's own commit to keep the unrelated changes separate.

evgeny777 added inline comments.Dec 21 2020, 4:35 AM
llvm/lib/Target/AArch64/AArch64SchedA57.td
379–383

@mnadeem It looks like they don't (at least PMUL). I've done some experiments with llvm-exegesis with following results:

  1. Latency of PMUL is 4 cycles, not 5 cycles
  2. There is always 4 cyc latency for PMUL result forwarded to MLA/MLS accumulator

I've used Jetson Nano board for testing

mnadeem marked an inline comment as done.Dec 26 2020, 10:09 AM

@evgeny777 Regarding the PMUL latency the optimization guide says this. AArch64SchedA57.td probably has older latencies. Should these be updated?

Cortex-A57 r1p0 and later reduce the latency of ASIMD multiply and multiply-with-accumulate instructions relative to r0pX.

@dmgreen You are correct the only MUL forwarding is from FP MUL (FMUL/FNMUL) and ASIMD FP MUL (FMUL/FMULX). Let me correct that and divide the patch into two.

I'll have to see if there is any negative performance impact first.

mnadeem edited the summary of this revision. (Show Details)Dec 27 2020, 11:06 AM
mnadeem updated this revision to Diff 313793.Dec 27 2020, 11:14 AM
  1. Removed changes that were not related to forwarding.
  2. Removed forward from ASMID multiplies (mul, pmul, sqdmulh, smull, umull).
  3. Updated the test case accordingly.
  1. Removed forward from ASMID multiplies (mul, pmul, sqdmulh, smull, umull).

As far as I can tell from some experiments, mul and smull/umull can forward into an accumulator. The others (pmul and sqdmul) do not. The tests for them would be good to keep too.

Other than that, this looks good to me.

evgeny777 added a comment.EditedDec 30 2020, 9:48 AM

Regarding the PMUL latency the optimization guide says this. AArch64SchedA57.td probably has older latencies. Should these be updated?

Yes, but this deserves separate patch, I think

mnadeem updated this revision to Diff 314139.Dec 30 2020, 12:38 PM
mnadeem added a reviewer: dmgreen.

Added mul, smull, umull forwards again.
Added the relevant tests again and kept the non-forwarding tests for pmul, sqdmulh, sqrdmulh, sqdmull, pmull.

mnadeem marked 4 inline comments as done.Dec 30 2020, 12:42 PM
mnadeem added inline comments.
llvm/lib/Target/AArch64/AArch64SchedA57.td
378

Divided these into two because pmul etc dont forward.

dmgreen accepted this revision.Dec 30 2020, 1:15 PM

Thanks. LGTM

I've not ran any benchmarking, and this is touching some fairly commonly used parts of a commonly used schedule (and scheduling can be weird at times). If we do run into problems we can address them as we need, but it looks like a nice improvement and the numbers you have look good. Thanks for the patch.

This revision is now accepted and ready to land.Dec 30 2020, 1:15 PM

Thanks for accepting the revision.

I don't have commit access.

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Jan 4 2021, 3:16 AM

Looks like this breaks tests on arm macs: http://45.33.8.238/macm1/1057/step_10.txt

Please take a look, and revert for now if it takes a while to fix.

Oh strange. Let me try with a more specific triple..

OK that seems to have done better. Thanks for the report. Let us know if anything else comes up.

thakis added a comment.Jan 4 2021, 5:26 AM

Looks good, thanks for the quick fix :)