This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Enable Tail-Folding. WIP
AbandonedPublic

Authored by SjoerdMeijer on Nov 21 2022, 5:39 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
david-arm
dmgreen
sdesmalen
fhahn
efriedma

Summary

This is enabling tail-folding for SVE. As you know, tail-folding has great potential to improve codegen by not having to emit a vector + epilogue loop, runtime checks for this, and also some setup code for the vector loop. This can help performance significantly in some cases.

I have added WIP (work-in-progress) to the subject as I am looking into collecting some more performance numbers and wanted to get your input while I am doing that.

My results so far on a 2x256b SVE implementation:

5% uplift for X264 (SPEC INT 2017)
Neutral for the other apps in SPECINT2017.
1% uplift on an embedded benchmark. It's not a very representative workload, but it has a few matrix kernels and this 1% is significant for that benchmark, nicely illustrating benefits of tail-folding.
I've tried the llvm test-suite, but just trying to generate a baseline shows it's really noisy. I haven't yet tried with tail-folding, because I am not sure I can conclude anything from the numbers.

What I will do next is getting numbers for SPEC FP 2017.

This change enables the "simple" tail-folding, so isn't e.g. dealing with reductions/recurrences. This seemed like a first good step to me while we get more experience with this. I am interested to hear if you have suggestions for workloads or cases that I should check.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Nov 21 2022, 5:39 AM

Herald added a reviewer: efriedma. · View Herald TranscriptNov 21 2022, 5:39 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: ctetreau, psnobl, hiraditya and 2 others. · View Herald Transcript

SjoerdMeijer requested review of this revision.Nov 21 2022, 5:39 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 21 2022, 5:39 AM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

Hi @SjoerdMeijer, thanks for looking into this. I do actually already have a patch to enable this by default (https://reviews.llvm.org/D130618), where the default behaviour is tuned according to the CPU. I think this is what we want because the profile will change according to what CPU you're running on - some CPUs may handle reductions better than others. The decision in this patch may be incorrect for 128-bit vector implementations. I also ran SPEC2k17 on a SVE-enabled CPU as well and I remember I saw a small (2-3%) regression in parest or something like that, which is one of the reasons I didn't push the patch any further. I also think it's really important to run a much larger set of benchmarks besides SPEC2k17 and collect numbers to show the benefits, since there isn't much vectorisation actually going on in SPEC2k17.

One of the major problems with the currrent tail-folding implementation is that we make the decision before doing any cost analysis in the vectoriser, which isn't great because we may be forcing the vectoriser to take different code paths to if we didn't tail-fold. Ideally what we really want is to move to a model where the vectoriser has a two-dimensional matrix of costs considering the combination of VF and vectorisation style (e.g. tail-folding vs whole vector loops, etc.), and choose the most optimal combination.

Ah, I didn't know about D130618!
I will reply on that ticket, that's probably best to keep things in one place?

Harbormaster completed remote builds in B198764: Diff 476865.Nov 21 2022, 3:01 PM

Matt added a subscriber: Matt.Nov 21 2022, 3:09 PM

SjoerdMeijer abandoned this revision.Mar 17 2023, 1:39 AM

Herald added a subscriber: StephenFan. · View Herald TranscriptMar 17 2023, 1:39 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

4 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-tail-folding-option.ll

38 lines

Diff 476865

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	if (Val.empty())
return;		return;
SmallVector<StringRef, 6> TailFoldTypes;		SmallVector<StringRef, 6> TailFoldTypes;
StringRef(Val).split(TailFoldTypes, '+', -1, false);		StringRef(Val).split(TailFoldTypes, '+', -1, false);
for (auto TailFoldType : TailFoldTypes) {		for (auto TailFoldType : TailFoldTypes) {
if (TailFoldType == "disabled")		if (TailFoldType == "disabled")
Bits = 0;		Bits = 0;
else if (TailFoldType == "all")		else if (TailFoldType == "all")
Bits = TFAll;		Bits = TFAll;
else if (TailFoldType == "default")		else if (TailFoldType == "default"\|\| TailFoldType == "simple")
Bits = 0; // Currently defaults to never tail-folding.
else if (TailFoldType == "simple")
add(TFSimple);		add(TFSimple);
else if (TailFoldType == "reductions")		else if (TailFoldType == "reductions")
add(TFReductions);		add(TFReductions);
else if (TailFoldType == "recurrences")		else if (TailFoldType == "recurrences")
add(TFRecurrences);		add(TFRecurrences);
else if (TailFoldType == "noreductions")		else if (TailFoldType == "noreductions")
remove(TFReductions);		remove(TFReductions);
else if (TailFoldType == "norecurrences")		else if (TailFoldType == "norecurrences")
▲ Show 20 Lines • Show All 3,163 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-option.ll

	; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled -S \| FileCheck %s -check-prefix=CHECK-NOTF			; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled -S \| FileCheck %s -check-prefix=CHECK-NOTF
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=default -S \| FileCheck %s -check-prefix=CHECK-NOTF			; RUN: opt < %s -loop-vectorize -sve-tail-folding=default -S \| FileCheck %s -check-prefix=CHECK-SIMPLE
				; RUN: opt < %s -loop-vectorize -sve-tail-folding=simple -S \| FileCheck %s -check-prefix=CHECK-SIMPLE
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=all -S \| FileCheck %s -check-prefix=CHECK-TF			; RUN: opt < %s -loop-vectorize -sve-tail-folding=all -S \| FileCheck %s -check-prefix=CHECK-TF
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled+simple+reductions+recurrences -S \| FileCheck %s -check-prefix=CHECK-TF			; RUN: opt < %s -loop-vectorize -sve-tail-folding=disabled+simple+reductions+recurrences -S \| FileCheck %s -check-prefix=CHECK-TF
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=all+noreductions -S \| FileCheck %s -check-prefix=CHECK-TF-NORED			; RUN: opt < %s -loop-vectorize -sve-tail-folding=all+noreductions -S \| FileCheck %s -check-prefix=CHECK-TF-NORED
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=all+norecurrences -S \| FileCheck %s -check-prefix=CHECK-TF-NOREC			; RUN: opt < %s -loop-vectorize -sve-tail-folding=all+norecurrences -S \| FileCheck %s -check-prefix=CHECK-TF-NOREC
	; RUN: opt < %s -loop-vectorize -sve-tail-folding=reductions -S \| FileCheck %s -check-prefix=CHECK-TF-ONLYRED			; RUN: opt < %s -loop-vectorize -sve-tail-folding=reductions -S \| FileCheck %s -check-prefix=CHECK-TF-ONLYRED

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {			define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
	; CHECK-NOTF-LABEL: @simple_memset(			; CHECK-NOTF-LABEL: @simple_memset(
	; CHECK-NOTF: vector.ph:			; CHECK-NOTF: vector.ph:
	; CHECK-NOTF: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0			; CHECK-NOTF: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0
	; CHECK-NOTF: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NOTF: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NOTF: vector.body:			; CHECK-NOTF: vector.body:
	; CHECK-NOTF-NOT: %{{.*}} = phi <vscale x 4 x i1>			; CHECK-NOTF-NOT: %{{.*}} = phi <vscale x 4 x i1>
	; CHECK-NOTF: store <vscale x 4 x i32> %[[SPLAT]], <vscale x 4 x i32>*			; CHECK-NOTF: store <vscale x 4 x i32> %[[SPLAT]], <vscale x 4 x i32>*

				; CHECK-SIMPLE-LABEL: @simple_memset(
				; CHECK-SIMPLE: vector.ph:
				; CHECK-SIMPLE: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0
				; CHECK-SIMPLE: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-SIMPLE: vector.body:
				; CHECK-SIMPLE: %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>
				; CHECK-SIMPLE: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> %[[SPLAT]], {{.*}} %[[ACTIVE_LANE_MASK]]

	; CHECK-TF-NORED-LABEL: @simple_memset(			; CHECK-TF-NORED-LABEL: @simple_memset(
	; CHECK-TF-NORED: vector.ph:			; CHECK-TF-NORED: vector.ph:
	; CHECK-TF-NORED: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0			; CHECK-TF-NORED: %[[INSERT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %val, i32 0
	; CHECK-TF-NORED: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-TF-NORED: %[[SPLAT:.*]] = shufflevector <vscale x 4 x i32> %[[INSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-TF-NORED: vector.body:			; CHECK-TF-NORED: vector.body:
	; CHECK-TF-NORED: %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>			; CHECK-TF-NORED: %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>
	; CHECK-TF-NORED: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> %[[SPLAT]], {{.*}} %[[ACTIVE_LANE_MASK]]			; CHECK-TF-NORED: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> %[[SPLAT]], {{.*}} %[[ACTIVE_LANE_MASK]]

	Show All 40 Lines
	; CHECK-NOTF-LABEL: @fadd_red_fast			; CHECK-NOTF-LABEL: @fadd_red_fast
	; CHECK-NOTF: vector.body:			; CHECK-NOTF: vector.body:
	; CHECK-NOTF-NOT: %{{.*}} = phi <vscale x 4 x i1>			; CHECK-NOTF-NOT: %{{.*}} = phi <vscale x 4 x i1>
	; CHECK-NOTF: %[[LOAD:.*]] = load <vscale x 4 x float>			; CHECK-NOTF: %[[LOAD:.*]] = load <vscale x 4 x float>
	; CHECK-NOTF: %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]			; CHECK-NOTF: %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]
	; CHECK-NOTF: middle.block:			; CHECK-NOTF: middle.block:
	; CHECK-NOTF-NEXT: call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[ADD]])			; CHECK-NOTF-NEXT: call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[ADD]])

				; CHECK-SIMPLE-LABEL: @fadd_red_fast
				; CHECK-SIMPLE: vector.body:
				; CHECK-SIMPLE-NOT: %{{.*}} = phi <vscale x 4 x i1>
				; CHECK-SIMPLE: %[[LOAD:.*]] = load <vscale x 4 x float>
				; CHECK-SIMPLE: %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]
				; CHECK-SIMPLE: middle.block:
				; CHECK-SIMPLE-NEXT: call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[ADD]])

	; CHECK-TF-NORED-LABEL: @fadd_red_fast			; CHECK-TF-NORED-LABEL: @fadd_red_fast
	; CHECK-TF-NORED: vector.body:			; CHECK-TF-NORED: vector.body:
	; CHECK-TF-NORED-NOT: %{{.*}} = phi <vscale x 4 x i1>			; CHECK-TF-NORED-NOT: %{{.*}} = phi <vscale x 4 x i1>
	; CHECK-TF-NORED: %[[LOAD:.*]] = load <vscale x 4 x float>			; CHECK-TF-NORED: %[[LOAD:.*]] = load <vscale x 4 x float>
	; CHECK-TF-NORED: %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]			; CHECK-TF-NORED: %[[ADD:.*]] = fadd fast <vscale x 4 x float> %[[LOAD]]
	; CHECK-TF-NORED: middle.block:			; CHECK-TF-NORED: middle.block:
	; CHECK-TF-NORED-NEXT: call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[ADD]])			; CHECK-TF-NORED-NEXT: call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> %[[ADD]])

	▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	; CHECK-NOTF: vector.body:			; CHECK-NOTF: vector.body:
	; CHECK-NOTF-NOT: %{{.*}} = phi <vscale x 4 x i1>			; CHECK-NOTF-NOT: %{{.*}} = phi <vscale x 4 x i1>
	; CHECK-NOTF: %[[VECTOR_RECUR:.]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.]], %vector.body ]			; CHECK-NOTF: %[[VECTOR_RECUR:.]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.]], %vector.body ]
	; CHECK-NOTF: %[[LOAD]] = load <vscale x 4 x i32>			; CHECK-NOTF: %[[LOAD]] = load <vscale x 4 x i32>
	; CHECK-NOTF: %[[SPLICE:.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %[[VECTOR_RECUR]], <vscale x 4 x i32> %[[LOAD]], i32 -1)			; CHECK-NOTF: %[[SPLICE:.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %[[VECTOR_RECUR]], <vscale x 4 x i32> %[[LOAD]], i32 -1)
	; CHECK-NOTF: %[[ADD:.*]] = add nsw <vscale x 4 x i32> %[[LOAD]], %[[SPLICE]]			; CHECK-NOTF: %[[ADD:.*]] = add nsw <vscale x 4 x i32> %[[LOAD]], %[[SPLICE]]
	; CHECK-NOTF: store <vscale x 4 x i32> %[[ADD]]			; CHECK-NOTF: store <vscale x 4 x i32> %[[ADD]]

				; CHECK-SIMPLE-LABEL: @add_recur
				; CHECK-SIMPLE: entry:
				; CHECK-SIMPLE: %[[PRE:.]] = load i32, i32 %src, align 4
				; CHECK-SIMPLE: vector.ph:
				; CHECK-SIMPLE: %[[RECUR_INIT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %[[PRE]]
				; CHECK-SIMPLE: vector.body:
				; CHECK-SIMPLE-NOT: %{{.*}} = phi <vscale x 4 x i1>
				; CHECK-SIMPLE: %[[VECTOR_RECUR:.]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.]], %vector.body ]
				; CHECK-SIMPLE: %[[LOAD]] = load <vscale x 4 x i32>
				; CHECK-SIMPLE: %[[SPLICE:.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %[[VECTOR_RECUR]], <vscale x 4 x i32> %[[LOAD]], i32 -1)
				; CHECK-SIMPLE: %[[ADD:.*]] = add nsw <vscale x 4 x i32> %[[LOAD]], %[[SPLICE]]
				; CHECK-SIMPLE: store <vscale x 4 x i32> %[[ADD]]

	; CHECK-TF-NORED-LABEL: @add_recur			; CHECK-TF-NORED-LABEL: @add_recur
	; CHECK-TF-NORED: entry:			; CHECK-TF-NORED: entry:
	; CHECK-TF-NORED: %[[PRE:.]] = load i32, i32 %src, align 4			; CHECK-TF-NORED: %[[PRE:.]] = load i32, i32 %src, align 4
	; CHECK-TF-NORED: vector.ph:			; CHECK-TF-NORED: vector.ph:
	; CHECK-TF-NORED: %[[RECUR_INIT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %[[PRE]]			; CHECK-TF-NORED: %[[RECUR_INIT:.*]] = insertelement <vscale x 4 x i32> poison, i32 %[[PRE]]
	; CHECK-TF-NORED: vector.body:			; CHECK-TF-NORED: vector.body:
	; CHECK-TF-NORED: %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>			; CHECK-TF-NORED: %[[ACTIVE_LANE_MASK:.*]] = phi <vscale x 4 x i1>
	; CHECK-TF-NORED: %[[VECTOR_RECUR:.]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.]], %vector.body ]			; CHECK-TF-NORED: %[[VECTOR_RECUR:.]] = phi <vscale x 4 x i32> [ %[[RECUR_INIT]], %vector.ph ], [ %[[LOAD:.]], %vector.body ]
	▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines

	define void @interleave(float* noalias %dst, float* noalias %src, i64 %n) #0 {			define void @interleave(float* noalias %dst, float* noalias %src, i64 %n) #0 {
	; CHECK-NOTF-LABEL: @interleave(			; CHECK-NOTF-LABEL: @interleave(
	; CHECK-NOTF: vector.body:			; CHECK-NOTF: vector.body:
	; CHECK-NOTF: %[[LOAD:.*]] = load <8 x float>, <8 x float>			; CHECK-NOTF: %[[LOAD:.*]] = load <8 x float>, <8 x float>
	; CHECK-NOTF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK-NOTF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK-NOTF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; CHECK-NOTF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

				; CHECK-SIMPLE-LABEL: @interleave(
				; CHECK-SIMPLE: vector.body:
				; CHECK-SIMPLE: %[[LOAD:.*]] = load <8 x float>, <8 x float>
				; CHECK-SIMPLE: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; CHECK-SIMPLE: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

	; CHECK-TF-LABEL: @interleave(			; CHECK-TF-LABEL: @interleave(
	; CHECK-TF: vector.body:			; CHECK-TF: vector.body:
	; CHECK-TF: %[[LOAD:.*]] = load <8 x float>, <8 x float>			; CHECK-TF: %[[LOAD:.*]] = load <8 x float>, <8 x float>
	; CHECK-TF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK-TF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK-TF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>			; CHECK-TF: %{{.*}} = shufflevector <8 x float> %[[LOAD]], <8 x float> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

	; CHECK-TF-NORED-LABEL: @interleave(			; CHECK-TF-NORED-LABEL: @interleave(
	; CHECK-TF-NORED: vector.body:			; CHECK-TF-NORED: vector.body:
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines