This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/1
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
4/4
strict-fadd-vf1.ll

Differential D106646

[LoopVectorize] Don't interleave scalar ordered reductions for inner loops
ClosedPublic

Authored by david-arm on Jul 23 2021, 3:49 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
dmgreen
c-rhodes
peterwaller-arm

Commits

rGa5dd6c6cf935: [LoopVectorize] Don't interleave scalar ordered reductions for inner loops

Summary

Consider the following loop:

void foo(float *dst, float *src, int N) {
  for (int i = 0; i < N; i++) {
    dst[i] = 0.0;
    for (int j = 0; j < N; j++) {
      dst[i] += src[(i * N) + j];
    }
  }
}

When we are not building with -Ofast we may attempt to vectorise the
inner loop using ordered reductions instead. In addition we also try
to select an appropriate interleave count for the inner loop. However,
when choosing a VF=1 the inner loop will be scalar and there is existing
code in selectInterleaveCount that limits the interleave count to 2
for reductions due to concerns about increasing the critical path.
For ordered reductions this problem is even worse due to the additional
data dependency, and so I've added code to simply disable interleaving
for scalar ordered reductions for now.

Test added here:

Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll

Diff Detail

Event Timeline

david-arm created this revision.Jul 23 2021, 3:49 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptJul 23 2021, 3:49 AM

david-arm requested review of this revision.Jul 23 2021, 3:49 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2021, 3:49 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

david-arm added a parent revision: D105432: [Analysis] Add simple cost model for strict (in-order) reductions.Jul 23 2021, 3:49 AM

Harbormaster completed remote builds in B115817: Diff 361151.Jul 23 2021, 4:25 AM

david-arm added a child revision: D106653: [LoopVectorize][AArch64] Enable ordered reductions by default for AArch64.Jul 23 2021, 5:16 AM

sdesmalen added inline comments.Jul 26 2021, 7:00 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6550	I don't think I fully understand why disabling interleaving is more profitable than having it enabled when VF=1, but I think you empirically found that having a UF=1 when VF>1 leads to regressions when enabling strict reductions. This means that with this patch enabling strict reductions by default will no longer lead to regressions, whereas without strict reductions enabled, this loop would not have been vectorized or interleaved in the first place. So this is purely limiting the scope of strict-reductions to avoid regressions. That approach sounds sensible to me.
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll
7	This `REQUIRES: asserts` ?
16	is this needed for the test?

david-arm added inline comments.Jul 26 2021, 7:07 AM

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll
7	Thanks, good spot @sdesmalen!
16	Probably not. It arises from the IR generated for the following C code where `dst[i]` is being initialised with memset: void foo(float dst, float src, int N) { for (int i = 0; i < N; i++) { dst[i] = 0.0; for (int j = 0; j < N; j++) { dst[i] += src[(i * N) + j]; } } } Without the memset we're just adding to the initial value for `dst[i]`

Addressed review comments

david-arm marked 3 inline comments as done.Jul 27 2021, 6:50 AM

LGTM, cheers @david-arm.

This revision is now accepted and ready to land.Jul 27 2021, 6:55 AM

Harbormaster completed remote builds in B116418: Diff 362013.Jul 27 2021, 7:35 AM

Closed by commit rGa5dd6c6cf935: [LoopVectorize] Don't interleave scalar ordered reductions for inner loops (authored by david-arm). · Explain WhyJul 27 2021, 9:50 AM

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGa5dd6c6cf935: [LoopVectorize] Don't interleave scalar ordered reductions for inner loops.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

16 lines

test/

Transforms/

LoopVectorize/

AArch64/

strict-fadd-vf1.ll

42 lines

Diff 362013

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,539 Lines • ▼ Show 20 Lines	if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) {
// saturated.		// saturated.
unsigned NumStores = Legal->getNumStores();		unsigned NumStores = Legal->getNumStores();
unsigned NumLoads = Legal->getNumLoads();		unsigned NumLoads = Legal->getNumLoads();
unsigned StoresIC = IC / (NumStores ? NumStores : 1);		unsigned StoresIC = IC / (NumStores ? NumStores : 1);
unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);		unsigned LoadsIC = IC / (NumLoads ? NumLoads : 1);

// If we have a scalar reduction (vector reductions are already dealt with		// If we have a scalar reduction (vector reductions are already dealt with
// by this point), we can increase the critical path length if the loop		// by this point), we can increase the critical path length if the loop
// we're interleaving is inside another loop. Limit, by default to 2, so the		// we're interleaving is inside another loop. For tree-wise reductions
// critical path only gets increased by one reduction operation.		// set the limit to 2, and for ordered reductions it's best to disable
		// interleaving entirely.
		sdesmalenUnsubmitted Done Reply Inline Actions I don't think I fully understand why disabling interleaving is more profitable than having it enabled when VF=1, but I think you empirically found that having a UF=1 when VF>1 leads to regressions when enabling strict reductions. This means that with this patch enabling strict reductions by default will no longer lead to regressions, whereas without strict reductions enabled, this loop would not have been vectorized or interleaved in the first place. So this is purely limiting the scope of strict-reductions to avoid regressions. That approach sounds sensible to me. sdesmalen: I don't think I fully understand //why// disabling interleaving is more profitable than having…
if (HasReductions && TheLoop->getLoopDepth() > 1) {		if (HasReductions && TheLoop->getLoopDepth() > 1) {
		bool HasOrderedReductions =
		any_of(Legal->getReductionVars(), [&](auto &Reduction) -> bool {
		const RecurrenceDescriptor &RdxDesc = Reduction.second;
		return RdxDesc.isOrdered();
		});
		if (HasOrderedReductions) {
		LLVM_DEBUG(
		dbgs() << "LV: Not interleaving scalar ordered reductions.\n");
		return 1;
		}

unsigned F = static_cast<unsigned>(MaxNestedScalarReductionIC);		unsigned F = static_cast<unsigned>(MaxNestedScalarReductionIC);
SmallIC = std::min(SmallIC, F);		SmallIC = std::min(SmallIC, F);
StoresIC = std::min(StoresIC, F);		StoresIC = std::min(StoresIC, F);
LoadsIC = std::min(LoadsIC, F);		LoadsIC = std::min(LoadsIC, F);
}		}

if (EnableLoadStoreRuntimeInterleave &&		if (EnableLoadStoreRuntimeInterleave &&
std::max(StoresIC, LoadsIC) > SmallIC) {		std::max(StoresIC, LoadsIC) > SmallIC) {
▲ Show 20 Lines • Show All 3,938 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-vf1.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -loop-vectorize -enable-strict-reductions=true -force-vector-width=1 -S < %s -debug 2>log \| FileCheck %s
				; RUN: cat log \| FileCheck %s --check-prefix=CHECK-DEBUG

				target triple = "aarch64-unknown-linux-gnu"

				; CHECK-DEBUG: LV: Not interleaving scalar ordered reductions.
				sdesmalenUnsubmitted Done Reply Inline Actions This `REQUIRES: asserts` ? sdesmalen: This `REQUIRES: asserts` ?
				david-armAuthorUnsubmitted Done Reply Inline Actions Thanks, good spot @sdesmalen! david-arm: Thanks, good spot @sdesmalen!

				define void @foo(float* noalias nocapture %dst, float* noalias nocapture readonly %src, i64 %M, i64 %N) {
				; CHECK-LABEL: @foo(
				; CHECK-NOT: vector.body

				entry:
				br label %for.body.us

				for.body.us: ; preds = %entry, %for.cond3
				sdesmalenUnsubmitted Done Reply Inline Actions is this needed for the test? sdesmalen: is this needed for the test?
				david-armAuthorUnsubmitted Done Reply Inline Actions Probably not. It arises from the IR generated for the following C code where `dst[i]` is being initialised with memset: void foo(float dst, float src, int N) { for (int i = 0; i < N; i++) { dst[i] = 0.0; for (int j = 0; j < N; j++) { dst[i] += src[(i * N) + j]; } } } Without the memset we're just adding to the initial value for `dst[i]` david-arm: Probably not. It arises from the IR generated for the following C code where `dst[i]` is being…
				%i.023.us = phi i64 [ %inc8.us, %for.cond3 ], [ 0, %entry ]
				%arrayidx.us = getelementptr inbounds float, float* %dst, i64 %i.023.us
				%mul.us = mul nsw i64 %i.023.us, %N
				br label %for.body3.us

				for.body3.us: ; preds = %for.body.us, %for.body3.us
				%0 = phi float [ 0.000000e+00, %for.body.us ], [ %add6.us, %for.body3.us ]
				%j.021.us = phi i64 [ 0, %for.body.us ], [ %inc.us, %for.body3.us ]
				%add.us = add nsw i64 %j.021.us, %mul.us
				%arrayidx4.us = getelementptr inbounds float, float* %src, i64 %add.us
				%1 = load float, float* %arrayidx4.us, align 4
				%add6.us = fadd float %1, %0
				%inc.us = add nuw nsw i64 %j.021.us, 1
				%exitcond.not = icmp eq i64 %inc.us, %N
				br i1 %exitcond.not, label %for.cond3, label %for.body3.us

				for.cond3: ; preds = %for.body3.us
				%add6.us.lcssa = phi float [ %add6.us, %for.body3.us ]
				store float %add6.us.lcssa, float* %arrayidx.us, align 4
				%inc8.us = add nuw nsw i64 %i.023.us, 1
				%exitcond26.not = icmp eq i64 %inc8.us, %M
				br i1 %exitcond26.not, label %exit, label %for.body.us

				exit: ; preds = %for.cond3
				ret void
				}