Page MenuHomePhabricator

[AArch64] Enable FeatureFuseAES for the generic processor model.
ClosedPublic

Authored by fhahn on Jun 2 2017, 8:38 AM.

Details

Summary

Scheduling AESE/AESMC and AESD/AESIMC instruction pairs back-to-back
gives a double digit speedup on benchmarks using those instructions on
Cortex-A processors. In GCC, this optimization is part of the generic
processor model as well.

This change should not have a major performance impact on processors
that do not optimize AES instruction pairs, although I only had access
to Cortex-A processors for benchmarking.

Diff Detail

Event Timeline

fhahn created this revision.Jun 2 2017, 8:38 AM
rengolin edited edge metadata.Jun 2 2017, 8:53 AM

Makes sense to me. But now that "generic" is the default, it'll impact all cores equally, so I'll let other people comment before approving.

evandro edited edge metadata.Jun 2 2017, 12:21 PM

Since it follows the precedent in GCC, I am fine with it. I suspect that targets that do not support this, whether in order or out of order, would perform the same. Can you share any data, @fhahn?

fhahn added a comment.Jun 2 2017, 1:44 PM

The performance impact on cores without that optimization should be minimal, but unfortunately I only have access to Cortex-A cores for benchmarking, so I cannot provide any numbers for other cores. On early Cortex-A cores the impact was not noticeable though.

That's fine, @fhahn. Perhaps some figures on A53 and A72?

javed.absar added inline comments.Jun 2 2017, 2:49 PM
test/CodeGen/AArch64/misched-fusion-aes.ll
6

If the cpu is generic, why is the check 'CHECKCORTEX' ?
Maybe, I am missing something.

joelkevinjones edited edge metadata.Jun 2 2017, 2:57 PM

I can probably scare up whether this impacts ThunderX processors. What benchmark should I look at? OpenSSL?

rengolin added inline comments.Jun 2 2017, 3:47 PM
test/CodeGen/AArch64/misched-fusion-aes.ll
6

I think it's just repeating the same pattern. Probably good to change to something more meaningful, like "CHECKFUSEAES" or something.

fhahn updated this revision to Diff 101408.Jun 5 2017, 8:18 AM
fhahn marked 2 inline comments as done.

Update CHECK- line.

@evandro I’m sorry, I cannot share the exact details for various reasons, but it was over 40% on a proprietary benchmark.

@joelkevinjones unfortunately I cannot share the benchmark we used either, and I’m not aware of a publicly-available one.

@evandro I’m sorry, I cannot share the exact details for various reasons, but it was over 40% on a proprietary benchmark.

I understand that, but perhaps you could have a simple loop with AESE/AESMC and another with AESD/AESIMC and report their throughput on A53 and A57 before and after the patch.

As a matter of fact, I encourage you to consider adding such a test to LNT.

fhahn added a comment.Jun 5 2017, 8:45 AM

I understand that the case for enabling FeatureFuseAES is quite weak at the moment! I'll see if I can find any public benchmarks, otherwise I'll try and add a handwritten benchmark to LNT.

evandro added inline comments.Jun 5 2017, 8:48 AM
test/CodeGen/AArch64/misched-fusion-aes.ll
1–6

It should really be CHECKGENERIC. CHECKFUSEAES would make more sense if you add a test with -mattr=fuse-aes.

2

I don't think that you should change CHECKCORTEX.

fhahn added a comment.Jun 6 2017, 8:10 AM

I had another look and it turns out I can’t add a microbenchmark for legal reasons. Unfortunately the only thing I can say is that back-to-back scheduling brings double digit performance improvements for code making heavy use of AES instructions on Cortex-A CPUs and refer to the public software optimization guides for Cortex-A72 and Cortex-A57, which both encourage this optimization.

javed.absar edited edge metadata.Jun 6 2017, 8:55 AM

My two cents. I think this is likely to give gains where pipeline leverages it, and unlikely to cause regressions otherwise.
Joel Jones mentioned checking performance on ThuderX so we should probably wait for him to come back on it.

We already have the IR in misched-fusion-aes.ll that can be used for microbenchmarks (just call the functions in a loop)? I think that should be theoretically enough for an evaluation.

Cheers,
Silviu

I'm okay with this change

fhahn added a comment.Jun 13 2017, 6:07 AM

ping. Did the last comments adequately address the concerns raised last week?

LGTM. @sbaranga are you also ok with this?

This generally makes sense to me!

However, since we are making this the default we should also hear some more opinions. @MatzeB , @mcrosier : any thoughts on this?

evandro added inline comments.Jun 13 2017, 8:17 AM
test/CodeGen/AArch64/misched-fusion-aes.ll
1–6

Methinks that there should also be a test solely with -mattr=fuse-aes and no -mcpu=....

MatzeB edited edge metadata.Jun 13 2017, 8:31 AM

This generally makes sense to me!

However, since we are making this the default we should also hear some more opinions. @MatzeB , @mcrosier : any thoughts on this?

At a higher level this makes sense to me. Apple targets default to cyclone so they should not be affected.

fhahn updated this revision to Diff 102355.Jun 13 2017, 9:37 AM
fhahn marked an inline comment as done.

Updated test case

evandro accepted this revision.Jun 13 2017, 11:27 AM
This revision is now accepted and ready to land.Jun 13 2017, 11:27 AM
fhahn added a comment.Jun 14 2017, 1:37 AM

Thanks for all the comments! Unless there are any objections raised, I'll commit the change tomorrow.

fhahn closed this revision.Jun 15 2017, 2:31 AM