This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
3/5
misched-fusion-aes.ll

Differential D33836

[AArch64] Enable FeatureFuseAES for the generic processor model.
ClosedPublic

Authored by fhahn on Jun 2 2017, 8:38 AM.

Download Raw Diff

Details

Reviewers

rengolin
kristof.beyls
javed.absar
evandro
silviu.baranga
MatzeB
mcrosier
joelkevinjones
joel_k_jones
bmakam
t.p.northover

Commits

rG0a26d2c298a6: [AArch64] Enable FeatureFuseAES for the generic processor model.
rL305457: [AArch64] Enable FeatureFuseAES for the generic processor model.

Summary

Scheduling AESE/AESMC and AESD/AESIMC instruction pairs back-to-back
gives a double digit speedup on benchmarks using those instructions on
Cortex-A processors. In GCC, this optimization is part of the generic
processor model as well.

This change should not have a major performance impact on processors
that do not optimize AES instruction pairs, although I only had access
to Cortex-A processors for benchmarking.

Diff Detail

Event Timeline

fhahn created this revision.Jun 2 2017, 8:38 AM

Herald added a subscriber: aemerson. · View Herald TranscriptJun 2 2017, 8:38 AM

Makes sense to me. But now that "generic" is the default, it'll impact all cores equally, so I'll let other people comment before approving.

Since it follows the precedent in GCC, I am fine with it. I suspect that targets that do not support this, whether in order or out of order, would perform the same. Can you share any data, @fhahn?

The performance impact on cores without that optimization should be minimal, but unfortunately I only have access to Cortex-A cores for benchmarking, so I cannot provide any numbers for other cores. On early Cortex-A cores the impact was not noticeable though.

That's fine, @fhahn. Perhaps some figures on A53 and A72?

javed.absar added inline comments.Jun 2 2017, 2:49 PM

test/CodeGen/AArch64/misched-fusion-aes.ll
6	If the cpu is generic, why is the check 'CHECKCORTEX' ? Maybe, I am missing something.

I can probably scare up whether this impacts ThunderX processors. What benchmark should I look at? OpenSSL?

rengolin added inline comments.Jun 2 2017, 3:47 PM

test/CodeGen/AArch64/misched-fusion-aes.ll
6	I think it's just repeating the same pattern. Probably good to change to something more meaningful, like "CHECKFUSEAES" or something.

Update CHECK- line.

@evandro I’m sorry, I cannot share the exact details for various reasons, but it was over 40% on a proprietary benchmark.

@joelkevinjones unfortunately I cannot share the benchmark we used either, and I’m not aware of a publicly-available one.

In D33836#772848, @fhahn wrote:

@evandro I’m sorry, I cannot share the exact details for various reasons, but it was over 40% on a proprietary benchmark.

I understand that, but perhaps you could have a simple loop with AESE/AESMC and another with AESD/AESIMC and report their throughput on A53 and A57 before and after the patch.

As a matter of fact, I encourage you to consider adding such a test to LNT.

I understand that the case for enabling FeatureFuseAES is quite weak at the moment! I'll see if I can find any public benchmarks, otherwise I'll try and add a handwritten benchmark to LNT.

evandro added inline comments.Jun 5 2017, 8:48 AM

test/CodeGen/AArch64/misched-fusion-aes.ll
1–6	It should really be CHECKGENERIC. CHECKFUSEAES would make more sense if you add a test with `-mattr=fuse-aes`.
2	I don't think that you should change CHECKCORTEX.

I had another look and it turns out I can’t add a microbenchmark for legal reasons. Unfortunately the only thing I can say is that back-to-back scheduling brings double digit performance improvements for code making heavy use of AES instructions on Cortex-A CPUs and refer to the public software optimization guides for Cortex-A72 and Cortex-A57, which both encourage this optimization.

My two cents. I think this is likely to give gains where pipeline leverages it, and unlikely to cause regressions otherwise.
Joel Jones mentioned checking performance on ThuderX so we should probably wait for him to come back on it.

We already have the IR in misched-fusion-aes.ll that can be used for microbenchmarks (just call the functions in a loop)? I think that should be theoretically enough for an evaluation.

Cheers,
Silviu

I'm okay with this change

ping. Did the last comments adequately address the concerns raised last week?

LGTM. @sbaranga are you also ok with this?

This generally makes sense to me!

However, since we are making this the default we should also hear some more opinions. @MatzeB , @mcrosier : any thoughts on this?

evandro added inline comments.Jun 13 2017, 8:17 AM

test/CodeGen/AArch64/misched-fusion-aes.ll
1–6	Methinks that there should also be a test solely with `-mattr=fuse-aes` and no `-mcpu=...`.

In D33836#778967, @sbaranga wrote:

This generally makes sense to me!

However, since we are making this the default we should also hear some more opinions. @MatzeB , @mcrosier : any thoughts on this?

At a higher level this makes sense to me. Apple targets default to cyclone so they should not be affected.

Updated test case

evandro accepted this revision.Jun 13 2017, 11:27 AM

This revision is now accepted and ready to land.Jun 13 2017, 11:27 AM

Thanks for all the comments! Unless there are any objections raised, I'll commit the change tomorrow.

fhahn closed this revision.Jun 15 2017, 2:31 AM

kristof.beyls mentioned this in D35260: [AArch64] Move AES instruction fusion support.Oct 11 2017, 12:27 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64.td

1 line

test/

CodeGen/

AArch64/

misched-fusion-aes.ll

77 lines

Diff 102355

lib/Target/AArch64/AArch64.td

Show First 20 Lines • Show All 356 Lines • ▼ Show 20 Lines	def ProcThunderXT83 : SubtargetFeature<"thunderxt83", "ARMProcFamily",
FeatureFPARMv8,		FeatureFPARMv8,
FeaturePerfMon,		FeaturePerfMon,
FeaturePostRAScheduler,		FeaturePostRAScheduler,
FeaturePredictableSelectIsExpensive,		FeaturePredictableSelectIsExpensive,
FeatureNEON]>;		FeatureNEON]>;

def : ProcessorModel<"generic", NoSchedModel, [		def : ProcessorModel<"generic", NoSchedModel, [
FeatureFPARMv8,		FeatureFPARMv8,
		FeatureFuseAES,
FeatureNEON,		FeatureNEON,
FeaturePerfMon,		FeaturePerfMon,
FeaturePostRAScheduler		FeaturePostRAScheduler
]>;		]>;

// FIXME: Cortex-A35 is currently modeled as a Cortex-A53.		// FIXME: Cortex-A35 is currently modeled as a Cortex-A53.
def : ProcessorModel<"cortex-a35", CortexA53Model, [ProcA35]>;		def : ProcessorModel<"cortex-a35", CortexA53Model, [ProcA35]>;
def : ProcessorModel<"cortex-a53", CortexA53Model, [ProcA53]>;		def : ProcessorModel<"cortex-a53", CortexA53Model, [ProcA53]>;
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

test/CodeGen/AArch64/misched-fusion-aes.ll

; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=cortex-a53 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKCORTEX		; RUN: llc %s -o - -mtriple=aarch64-unknown -mattr=+fuse-aes,+crypto \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKFUSEALLPAIRS
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=cortex-a57 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKCORTEX		; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=generic -mattr=+crypto \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKFUSEALLPAIRS
		evandroUnsubmitted Not Done Reply Inline Actions I don't think that you should change CHECKCORTEX. evandro: I don't think that you should change CHECKCORTEX.
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=cortex-a72 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKCORTEX		; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=cortex-a53 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKFUSEALLPAIRS
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=cortex-a73 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKCORTEX		; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=cortex-a57 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKFUSEALLPAIRS
		; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=cortex-a72 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKFUSEALLPAIRS
		; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=cortex-a73 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKFUSEALLPAIRS
		javed.absarUnsubmitted Done Reply Inline Actions If the cpu is generic, why is the check 'CHECKCORTEX' ? Maybe, I am missing something. javed.absar: If the cpu is generic, why is the check 'CHECKCORTEX' ? Maybe, I am missing something.
		rengolinUnsubmitted Done Reply Inline Actions I think it's just repeating the same pattern. Probably good to change to something more meaningful, like "CHECKFUSEAES" or something. rengolin: I think it's just repeating the same pattern. Probably good to change to something more…
		evandroUnsubmitted Not Done Reply Inline Actions It should really be CHECKGENERIC. CHECKFUSEAES would make more sense if you add a test with `-mattr=fuse-aes`. evandro: It should really be CHECKGENERIC. CHECKFUSEAES would make more sense if you add a test with `…
		evandroUnsubmitted Done Reply Inline Actions Methinks that there should also be a test solely with `-mattr=fuse-aes` and no `-mcpu=...`. evandro: Methinks that there should also be a test solely with `-mattr=fuse-aes` and no `-mcpu=...`.
; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=exynos-m1 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKM1		; RUN: llc %s -o - -mtriple=aarch64-unknown -mcpu=exynos-m1 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECKM1

declare <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %d, <16 x i8> %k)		declare <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %d, <16 x i8> %k)
declare <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %d)		declare <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %d)
declare <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8> %d, <16 x i8> %k)		declare <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8> %d, <16 x i8> %k)
declare <16 x i8> @llvm.aarch64.crypto.aesimc(<16 x i8> %d)		declare <16 x i8> @llvm.aarch64.crypto.aesimc(<16 x i8> %d)

define void @aesea(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {		define void @aesea(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	define void @aesea(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {
store <16 x i8> %h1, <16 x i8>* %c1		store <16 x i8> %h1, <16 x i8>* %c1
%c2 = getelementptr inbounds <16 x i8>, <16 x i8>* %c0, i64 2		%c2 = getelementptr inbounds <16 x i8>, <16 x i8>* %c0, i64 2
store <16 x i8> %h2, <16 x i8>* %c2		store <16 x i8> %h2, <16 x i8>* %c2
%c3 = getelementptr inbounds <16 x i8>, <16 x i8>* %c0, i64 3		%c3 = getelementptr inbounds <16 x i8>, <16 x i8>* %c0, i64 3
store <16 x i8> %h3, <16 x i8>* %c3		store <16 x i8> %h3, <16 x i8>* %c3
ret void		ret void

; CHECK-LABEL: aesea:		; CHECK-LABEL: aesea:
; CHECKCORTEX: aese [[VA:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aese [[VA:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesmc {{v[0-7].16b}}, [[VA]]		; CHECKFUSEALLPAIRS-NEXT: aesmc {{v[0-7].16b}}, [[VA]]
; CHECKCORTEX: aese [[VB:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aese [[VB:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesmc {{v[0-7].16b}}, [[VB]]		; CHECKFUSEALLPAIRS-NEXT: aesmc {{v[0-7].16b}}, [[VB]]
; CHECKCORTEX: aese [[VC:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aese [[VC:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesmc {{v[0-7].16b}}, [[VC]]		; CHECKFUSEALLPAIRS-NEXT: aesmc {{v[0-7].16b}}, [[VC]]
; CHECKCORTEX: aese [[VD:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aese [[VD:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesmc {{v[0-7].16b}}, [[VD]]		; CHECKFUSEALLPAIRS-NEXT: aesmc {{v[0-7].16b}}, [[VD]]
; CHECKCORTEX: aese [[VE:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aese [[VE:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesmc {{v[0-7].16b}}, [[VE]]		; CHECKFUSEALLPAIRS-NEXT: aesmc {{v[0-7].16b}}, [[VE]]
; CHECKCORTEX: aese [[VF:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aese [[VF:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesmc {{v[0-7].16b}}, [[VF]]		; CHECKFUSEALLPAIRS-NEXT: aesmc {{v[0-7].16b}}, [[VF]]
; CHECKCORTEX: aese [[VG:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aese [[VG:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesmc {{v[0-7].16b}}, [[VG]]		; CHECKFUSEALLPAIRS-NEXT: aesmc {{v[0-7].16b}}, [[VG]]
; CHECKCORTEX: aese [[VH:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aese [[VH:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesmc {{v[0-7].16b}}, [[VH]]		; CHECKFUSEALLPAIRS-NEXT: aesmc {{v[0-7].16b}}, [[VH]]
		; CHECKFUSEALLPAIRS-NOT: aesmc

; CHECKM1: aese [[VA:v[0-7].16b]], {{v[0-7].16b}}		; CHECKM1: aese [[VA:v[0-7].16b]], {{v[0-7].16b}}
; CHECKM1-NEXT: aesmc {{v[0-7].16b}}, [[VA]]		; CHECKM1-NEXT: aesmc {{v[0-7].16b}}, [[VA]]
; CHECKM1: aese [[VH:v[0-7].16b]], {{v[0-7].16b}}		; CHECKM1: aese [[VH:v[0-7].16b]], {{v[0-7].16b}}
; CHECKM1: aese [[VB:v[0-7].16b]], {{v[0-7].16b}}		; CHECKM1: aese [[VB:v[0-7].16b]], {{v[0-7].16b}}
; CHECKM1-NEXT: aesmc {{v[0-7].16b}}, [[VB]]		; CHECKM1-NEXT: aesmc {{v[0-7].16b}}, [[VB]]
; CHECKM1: aese {{v[0-7].16b}}, {{v[0-7].16b}}		; CHECKM1: aese {{v[0-7].16b}}, {{v[0-7].16b}}
; CHECKM1: aese [[VC:v[0-7].16b]], {{v[0-7].16b}}		; CHECKM1: aese [[VC:v[0-7].16b]], {{v[0-7].16b}}
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	define void @aesda(<16 x i8>* %a0, <16 x i8>* %b0, <16 x i8>* %c0, <16 x i8> %d, <16 x i8> %e) {
store <16 x i8> %h1, <16 x i8>* %c1		store <16 x i8> %h1, <16 x i8>* %c1
%c2 = getelementptr inbounds <16 x i8>, <16 x i8>* %c0, i64 2		%c2 = getelementptr inbounds <16 x i8>, <16 x i8>* %c0, i64 2
store <16 x i8> %h2, <16 x i8>* %c2		store <16 x i8> %h2, <16 x i8>* %c2
%c3 = getelementptr inbounds <16 x i8>, <16 x i8>* %c0, i64 3		%c3 = getelementptr inbounds <16 x i8>, <16 x i8>* %c0, i64 3
store <16 x i8> %h3, <16 x i8>* %c3		store <16 x i8> %h3, <16 x i8>* %c3
ret void		ret void

; CHECK-LABEL: aesda:		; CHECK-LABEL: aesda:
; CHECKCORTEX: aesd [[VA:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aesd [[VA:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesimc {{v[0-7].16b}}, [[VA]]		; CHECKFUSEALLPAIRS-NEXT: aesimc {{v[0-7].16b}}, [[VA]]
; CHECKCORTEX: aesd [[VB:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aesd [[VB:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesimc {{v[0-7].16b}}, [[VB]]		; CHECKFUSEALLPAIRS-NEXT: aesimc {{v[0-7].16b}}, [[VB]]
; CHECKCORTEX: aesd [[VC:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aesd [[VC:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesimc {{v[0-7].16b}}, [[VC]]		; CHECKFUSEALLPAIRS-NEXT: aesimc {{v[0-7].16b}}, [[VC]]
; CHECKCORTEX: aesd [[VD:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aesd [[VD:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesimc {{v[0-7].16b}}, [[VD]]		; CHECKFUSEALLPAIRS-NEXT: aesimc {{v[0-7].16b}}, [[VD]]
; CHECKCORTEX: aesd [[VE:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aesd [[VE:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesimc {{v[0-7].16b}}, [[VE]]		; CHECKFUSEALLPAIRS-NEXT: aesimc {{v[0-7].16b}}, [[VE]]
; CHECKCORTEX: aesd [[VF:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aesd [[VF:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesimc {{v[0-7].16b}}, [[VF]]		; CHECKFUSEALLPAIRS-NEXT: aesimc {{v[0-7].16b}}, [[VF]]
; CHECKCORTEX: aesd [[VG:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aesd [[VG:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesimc {{v[0-7].16b}}, [[VG]]		; CHECKFUSEALLPAIRS-NEXT: aesimc {{v[0-7].16b}}, [[VG]]
; CHECKCORTEX: aesd [[VH:v[0-7].16b]], {{v[0-7].16b}}		; CHECKFUSEALLPAIRS: aesd [[VH:v[0-7].16b]], {{v[0-7].16b}}
; CHECKCORTEX-NEXT: aesimc {{v[0-7].16b}}, [[VH]]		; CHECKFUSEALLPAIRS-NEXT: aesimc {{v[0-7].16b}}, [[VH]]
		; CHECKFUSEALLPAIRS-NOT: aesimc

; CHECKM1: aesd [[VA:v[0-7].16b]], {{v[0-7].16b}}		; CHECKM1: aesd [[VA:v[0-7].16b]], {{v[0-7].16b}}
; CHECKM1-NEXT: aesimc {{v[0-7].16b}}, [[VA]]		; CHECKM1-NEXT: aesimc {{v[0-7].16b}}, [[VA]]
; CHECKM1: aesd [[VH:v[0-7].16b]], {{v[0-7].16b}}		; CHECKM1: aesd [[VH:v[0-7].16b]], {{v[0-7].16b}}
; CHECKM1: aesd [[VB:v[0-7].16b]], {{v[0-7].16b}}		; CHECKM1: aesd [[VB:v[0-7].16b]], {{v[0-7].16b}}
; CHECKM1-NEXT: aesimc {{v[0-7].16b}}, [[VB]]		; CHECKM1-NEXT: aesimc {{v[0-7].16b}}, [[VB]]
; CHECKM1: aesd {{v[0-7].16b}}, {{v[0-7].16b}}		; CHECKM1: aesd {{v[0-7].16b}}, {{v[0-7].16b}}
; CHECKM1: aesd [[VC:v[0-7].16b]], {{v[0-7].16b}}		; CHECKM1: aesd [[VC:v[0-7].16b]], {{v[0-7].16b}}
Show All 29 Lines	entry:
store <16 x i8> %aesmc2, <16 x i8>* %x5, align 16		store <16 x i8> %aesmc2, <16 x i8>* %x5, align 16
ret void		ret void

; CHECK-LABEL: aes_load_store:		; CHECK-LABEL: aes_load_store:
; CHECK: aese [[VA:v[0-7].16b]], {{v[0-7].16b}}		; CHECK: aese [[VA:v[0-7].16b]], {{v[0-7].16b}}
; CHECK-NEXT: aesmc {{v[0-7].16b}}, [[VA]]		; CHECK-NEXT: aesmc {{v[0-7].16b}}, [[VA]]
; CHECK: aese [[VB:v[0-7].16b]], {{v[0-7].16b}}		; CHECK: aese [[VB:v[0-7].16b]], {{v[0-7].16b}}
; CHECK-NEXT: aesmc {{v[0-7].16b}}, [[VB]]		; CHECK-NEXT: aesmc {{v[0-7].16b}}, [[VB]]
		; CHECK-NOT: aesmc
}		}