This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
3/7
ARMTargetTransformInfo.cpp
-
test/Transforms/LoopVectorize/ARM/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
tail-loop-folding.ll

Differential D82953

[ARM][MVE] Only tail-fold integer add reductions
ClosedPublic

Authored by SjoerdMeijer on Jul 1 2020, 5:18 AM.

Download Raw Diff

Details

Reviewers

samparker
dmgreen

Commits

rG959eaa50d62d: [ARM][MVE] Only tail-fold integer add reductions

Summary

If a vector body has live-out values, it is probably a reduction, which needs a final reduction step after the loop. MVE has a VADDV instruction to reduce integer vectors, but doesn't have an equivalent one for float vectors. A live-out value that is not recognised as reduction later in the optimisation pipeline will result in the tail-predicated loop to be reverted to a non-predicated loop and this is very expensive, i.e. it has a significant performance impact, which is what we hope to avoid with fine tuning the ARM TTI hook preferPredicateOverEpilogue implementation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Jul 1 2020, 5:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 1 2020, 5:18 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

Hello. I agree with this, but I would suggest we take it further.

There are multiple forms of reductions that the vectorizer can produce. It's at least:
int add, int mul
float add, float mul
and, or, xor
int smin, smax, umin, umax
float min & max.

I think the ARMLowOverheadLoops will only currently handle integer add's. Considering how powerful the vectorizers handling of reductions is, and how easy (and expensive) it would be for the ARMLowOverheadLoops pass to get that wrong, I would suggest just disabling _all_ reductions at the moment.

Thanks Dave, those are good points.
ARMLowOverheadLoops indeed currently handle only integer add's, so I will have a go at further restricting the int operations.

Now only integer add reductions are allowed, all other operations (including floats) are rejected.

dmgreen added inline comments.Jul 2 2020, 12:42 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427	I took a look at the reduction code in ARMLowOverheadLoops. The vectorizer can create add reductions in a number of places and in a number of ways (it can have multiple reductions, multiple reduction steps, look through phi's and selects and even change the type to a smaller bitwidth). I don't think the backend pass can handle all of them yet. Considering how expensive reverting can be in comparison, what do you think of just disabling all reductions for the time being and re-enabling them once we know that things are working well enough? Maybe add a FIXME for it? I would expect D75069 to handle integer reduction in most cases eventually from the vectorizer directly and we can probably do something similar for other reduction types.

samparker added inline comments.Jul 2 2020, 1:12 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427	In my testing, I couldn't get the vectorizer to produce a tail-folded loops with multiple reductions, so I don't think we have to worry about that at the moment. My basic example also didn't even vectorize a FP reduction loop. I think disabling all reductions would be too conservative now, we should have done it from the beginning but, that now we actually can handle something, I think we should try to have this cost function align with LowOverheadLoops implementation.

SjoerdMeijer marked an inline comment as done.Jul 2 2020, 1:18 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427	If we really don't want tail-folding, I think that should be controlled in other way than restricting this hook to do nothing. We have option `-prefer-predicate-over-epilog`, and that should do that for now. If source-code modifications and more fine grained control is possible, we even have a pragma. Like Sam said, I would like to start with integer add reduction, so that we are also testing that part in the back-end.

SjoerdMeijer marked an inline comment as done.Jul 2 2020, 1:25 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427	I accidentally hit Submit too soon, but wanted to add a bit more nuance to my previous comment: "this hook to do nothing" -> "this hook to do almost nothing". I also wanted to add that in general matching up the behaviour here with the backend is probably best, and also that I could look into disallowing multiple reductions, but looks like we don't really have worry about that. And if there are some inefficiencies around integer add reductions, I guess that's fine at this stage (tail-folding can be disabled, see previous comment), and I think it's good to have these inefficiencies explicit and visible.

samparker added inline comments.Jul 2 2020, 2:18 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427	If we go for matching the limitations of LowOverheadLoops (of which there are many!) I think we want to check that: there's only one liveout. it's an add. and that add isn't using an add (because this breaks the pitifully weak pattern matching, see one_loop_add_add_v16i8 in the reductions.ll test. Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either.

dmgreen added inline comments.Jul 2 2020, 2:52 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427	Hmm. OK. I sent you some examples of the tests I had that were going wrong. Hopefully they will help. I would go with conservative for the time being to make turning on TailPred by default easier. We can always change it later, Plus D75069 will change the way integer reductions are generated anyway! I would think the backend pass is more useful for other reduction in the long run - but even then, like we talked about, it might be better to do things earlier. But if you guys think that you can get it to work - then sounds good. I'll leave it to you. I personally wouldn't spend a lot of time on it though, if the goal is in the end to do it another way.

SjoerdMeijer marked an inline comment as done.Jul 2 2020, 3:08 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1427	That would make most sense to me. there's only one liveout. it's an add. Check. (well, almost, need to restrict it to 1 live-out). and that add isn't using an add (because this breaks the pitifully weak pattern matching, see one_loop_add_add_v16i8 in the reductions.ll test. Will add this. Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either. Yep, will add this. This should go in `canTailPredicateInstruction()`, perhaps I will do this in a little separate patch,

SjoerdMeijer mentioned this in D83133: [ARM][MVE] Refactor option -disable-mve-tail-predication.Jul 3 2020, 7:45 AM

SjoerdMeijer mentioned this in rG595270ae3967: [ARM][MVE] Refactor option -disable-mve-tail-predication.Jul 13 2020, 5:41 AM

After our last discussion on this, we agreed that tail-folding should be disabled by default (which it is), that the only reductions that we want at this point are integer add reductions (added in this patch), and that we can control enabling/disabling reductions with an option (added in D83133). I thought this is what we all believed in. :)

Sounds good to me. Thanks.

This revision is now accepted and ready to land.Jul 13 2020, 11:39 AM

Closed by commit rG959eaa50d62d: [ARM][MVE] Only tail-fold integer add reductions (authored by SjoerdMeijer). · Explain WhyJul 14 2020, 2:15 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMTargetTransformInfo.cpp

44 lines

test/

Transforms/

LoopVectorize/

ARM/

tail-loop-folding.ll

327 lines

Diff 277717

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

	Show All 22 Lines
	#include "llvm/IR/IntrinsicInst.h"			#include "llvm/IR/IntrinsicInst.h"
	#include "llvm/IR/IntrinsicsARM.h"			#include "llvm/IR/IntrinsicsARM.h"
	#include "llvm/IR/PatternMatch.h"			#include "llvm/IR/PatternMatch.h"
	#include "llvm/IR/Type.h"			#include "llvm/IR/Type.h"
	#include "llvm/MC/SubtargetFeature.h"			#include "llvm/MC/SubtargetFeature.h"
	#include "llvm/Support/Casting.h"			#include "llvm/Support/Casting.h"
	#include "llvm/Support/MachineValueType.h"			#include "llvm/Support/MachineValueType.h"
	#include "llvm/Target/TargetMachine.h"			#include "llvm/Target/TargetMachine.h"
				#include "llvm/Transforms/Utils/LoopUtils.h"
	#include <algorithm>			#include <algorithm>
	#include <cassert>			#include <cassert>
	#include <cstdint>			#include <cstdint>
	#include <utility>			#include <utility>

	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "armtti"			#define DEBUG_TYPE "armtti"
	▲ Show 20 Lines • Show All 1,361 Lines • ▼ Show 20 Lines
	// 2) it should be smaller than i64s because we don't have vector operations			// 2) it should be smaller than i64s because we don't have vector operations
	// that work on i64s.			// that work on i64s.
	// 3) we don't want elements to be reversed or shuffled, to make sure the			// 3) we don't want elements to be reversed or shuffled, to make sure the
	// tail-predication masks/predicates the right lanes.			// tail-predication masks/predicates the right lanes.
	//			//
	static bool canTailPredicateLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,			static bool canTailPredicateLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,
	const DataLayout &DL,			const DataLayout &DL,
	const LoopAccessInfo *LAI) {			const LoopAccessInfo *LAI) {
				LLVM_DEBUG(dbgs() << "Tail-predication: checking allowed instructions\n");

				// If there are live-out values, it is probably a reduction, which needs a
				// final reduction step after the loop. MVE has a VADDV instruction to reduce
				// integer vectors, but doesn't have an equivalent one for float vectors. A
				// live-out value that is not recognised as a reduction will result in the
				// tail-predicated loop to be reverted to a non-predicated loop and this is
				// very expensive, i.e. it has a significant performance impact. So, in this
				// case it's better not to tail-predicate the loop, which is what we check
				// here. Thus, we allow only 1 live-out value, which has to be an integer
				// reduction, which matches the loops supported by ARMLowOverheadLoops.
				// It is important to keep ARMLowOverheadLoops and canTailPredicateLoop in
				// sync with each other.
				SmallVector< Instruction *, 8 > LiveOuts;
				LiveOuts = llvm::findDefsUsedOutsideOfLoop(L);
				bool IntReductionsDisabled =
				EnableTailPredication == TailPredication::EnabledNoReductions \|\|
				EnableTailPredication == TailPredication::ForceEnabledNoReductions;

				dmgreenUnsubmitted Not Done Reply Inline Actions I took a look at the reduction code in ARMLowOverheadLoops. The vectorizer can create add reductions in a number of places and in a number of ways (it can have multiple reductions, multiple reduction steps, look through phi's and selects and even change the type to a smaller bitwidth). I don't think the backend pass can handle all of them yet. Considering how expensive reverting can be in comparison, what do you think of just disabling all reductions for the time being and re-enabling them once we know that things are working well enough? Maybe add a FIXME for it? I would expect D75069 to handle integer reduction in most cases eventually from the vectorizer directly and we can probably do something similar for other reduction types. dmgreen: I took a look at the reduction code in ARMLowOverheadLoops. The vectorizer can create add…
				samparkerUnsubmitted Not Done Reply Inline Actions In my testing, I couldn't get the vectorizer to produce a tail-folded loops with multiple reductions, so I don't think we have to worry about that at the moment. My basic example also didn't even vectorize a FP reduction loop. I think disabling all reductions would be too conservative now, we should have done it from the beginning but, that now we actually can handle something, I think we should try to have this cost function align with LowOverheadLoops implementation. samparker: In my testing, I couldn't get the vectorizer to produce a tail-folded loops with multiple…
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions If we really don't want tail-folding, I think that should be controlled in other way than restricting this hook to do nothing. We have option `-prefer-predicate-over-epilog`, and that should do that for now. If source-code modifications and more fine grained control is possible, we even have a pragma. Like Sam said, I would like to start with integer add reduction, so that we are also testing that part in the back-end. SjoerdMeijer: If we really don't want tail-folding, I think that should be controlled in other way than…
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I accidentally hit Submit too soon, but wanted to add a bit more nuance to my previous comment: "this hook to do nothing" -> "this hook to do almost nothing". I also wanted to add that in general matching up the behaviour here with the backend is probably best, and also that I could look into disallowing multiple reductions, but looks like we don't really have worry about that. And if there are some inefficiencies around integer add reductions, I guess that's fine at this stage (tail-folding can be disabled, see previous comment), and I think it's good to have these inefficiencies explicit and visible. SjoerdMeijer: I accidentally hit Submit too soon, but wanted to add a bit more nuance to my previous comment…
				samparkerUnsubmitted Not Done Reply Inline Actions If we go for matching the limitations of LowOverheadLoops (of which there are many!) I think we want to check that: there's only one liveout. it's an add. and that add isn't using an add (because this breaks the pitifully weak pattern matching, see one_loop_add_add_v16i8 in the reductions.ll test. Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either. samparker: If we go for matching the limitations of LowOverheadLoops (of which there are many!) I think we…
				dmgreenUnsubmitted Not Done Reply Inline Actions Hmm. OK. I sent you some examples of the tests I had that were going wrong. Hopefully they will help. I would go with conservative for the time being to make turning on TailPred by default easier. We can always change it later, Plus D75069 will change the way integer reductions are generated anyway! I would think the backend pass is more useful for other reduction in the long run - but even then, like we talked about, it might be better to do things earlier. But if you guys think that you can get it to work - then sounds good. I'll leave it to you. I personally wouldn't spend a lot of time on it though, if the goal is in the end to do it another way. dmgreen: Hmm. OK. I sent you some examples of the tests I had that were going wrong. Hopefully they will…
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions That would make most sense to me. there's only one liveout. it's an add. Check. (well, almost, need to restrict it to 1 live-out). and that add isn't using an add (because this breaks the pitifully weak pattern matching, see one_loop_add_add_v16i8 in the reductions.ll test. Will add this. Half related to reductions, I'd also prevent if there's a select in the loop - I'm pretty sure min/max kernels will fail to be tail predicated and we can't handle any VPSEL in the loop either. Yep, will add this. This should go in `canTailPredicateInstruction()`, perhaps I will do this in a little separate patch, SjoerdMeijer: That would make most sense to me. > - there's only one liveout. > - it's an add. Check. (well…
				for (auto *I : LiveOuts) {
				if (!I->getType()->isIntegerTy()) {
				LLVM_DEBUG(dbgs() << "Don't tail-predicate loop with non-integer "
				"live-out value\n");
				return false;
				}
				if (I->getOpcode() != Instruction::Add) {
				LLVM_DEBUG(dbgs() << "Only add reductions supported\n");
				return false;
				}
				if (IntReductionsDisabled) {
				LLVM_DEBUG(dbgs() << "Integer add reductions not enabled\n");
				return false;
				}
				}

				// Next, check that all instructions can be tail-predicated.
	PredicatedScalarEvolution PSE = LAI->getPSE();			PredicatedScalarEvolution PSE = LAI->getPSE();
				SmallVector<Instruction *, 16> LoadStores;
	int ICmpCount = 0;			int ICmpCount = 0;
	int Stride = 0;			int Stride = 0;

	LLVM_DEBUG(dbgs() << "tail-predication: checking allowed instructions\n");
	SmallVector<Instruction *, 16> LoadStores;
	for (BasicBlock *BB : L->blocks()) {			for (BasicBlock *BB : L->blocks()) {
	for (Instruction &I : BB->instructionsWithoutDebug()) {			for (Instruction &I : BB->instructionsWithoutDebug()) {
	if (isa<PHINode>(&I))			if (isa<PHINode>(&I))
	continue;			continue;
	if (!canTailPredicateInstruction(I, ICmpCount)) {			if (!canTailPredicateInstruction(I, ICmpCount)) {
	LLVM_DEBUG(dbgs() << "Instruction not allowed: "; I.dump());			LLVM_DEBUG(dbgs() << "Instruction not allowed: "; I.dump());
	return false;			return false;
	}			}
	Show All 31 Lines
	}			}

	bool ARMTTIImpl::preferPredicateOverEpilogue(Loop L, LoopInfo LI,			bool ARMTTIImpl::preferPredicateOverEpilogue(Loop L, LoopInfo LI,
	ScalarEvolution &SE,			ScalarEvolution &SE,
	AssumptionCache &AC,			AssumptionCache &AC,
	TargetLibraryInfo *TLI,			TargetLibraryInfo *TLI,
	DominatorTree *DT,			DominatorTree *DT,
	const LoopAccessInfo *LAI) {			const LoopAccessInfo *LAI) {
	if (!EnableTailPredication)			if (!EnableTailPredication) {
				LLVM_DEBUG(dbgs() << "Tail-predication not enabled.\n");
	return false;			return false;
				}

	// Creating a predicated vector loop is the first step for generating a			// Creating a predicated vector loop is the first step for generating a
	// tail-predicated hardware loop, for which we need the MVE masked			// tail-predicated hardware loop, for which we need the MVE masked
	// load/stores instructions:			// load/stores instructions:
	if (!ST->hasMVEIntegerOps())			if (!ST->hasMVEIntegerOps())
	return false;			return false;

	// For now, restrict this to single block loops.			// For now, restrict this to single block loops.
	▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

; RUN: opt < %s -loop-vectorize -tail-predication=enabled -S \| \		; RUN: opt < %s -loop-vectorize -tail-predication=enabled -S \| \
; RUN: FileCheck %s -check-prefixes=COMMON,CHECK		; RUN: FileCheck %s -check-prefixes=COMMON,CHECK

; RUN: opt < %s -loop-vectorize -tail-predication=enabled -prefer-predicate-over-epilog -S \| \		; RUN: opt < %s -loop-vectorize -tail-predication=enabled -prefer-predicate-over-epilog -S \| \
; RUN: FileCheck -check-prefixes=COMMON,PREDFLAG %s		; RUN: FileCheck -check-prefixes=COMMON,PREDFLAG %s

		; RUN: opt < %s -loop-vectorize -tail-predication=enabled-no-reductions -S \| \
		; RUN: FileCheck %s -check-prefixes=COMMON,NORED

		; RUN: opt < %s -loop-vectorize -tail-predication=force-enabled-no-reductions -S \| \
		; RUN: FileCheck %s -check-prefixes=COMMON,NORED


target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"		target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv8.1m.main-arm-unknown-eabihf"		target triple = "thumbv8.1m.main-arm-unknown-eabihf"

define dso_local void @tail_folding(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) #0 {		define dso_local void @tail_folding(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) #0 {
; CHECK-LABEL: tail_folding(		; CHECK-LABEL: tail_folding(
; CHECK: vector.body:		; CHECK: vector.body:
;		;
; This needs implementation of TTI::preferPredicateOverEpilogue,		; This needs implementation of TTI::preferPredicateOverEpilogue,
▲ Show 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body.preheader, %for.body
%add = add nsw i32 %1, %0		%add = add nsw i32 %1, %0
%arrayidx2 = getelementptr inbounds i32, i32* %A, i32 %i.09		%arrayidx2 = getelementptr inbounds i32, i32* %A, i32 %i.09
store i32 %add, i32* %arrayidx2, align 4		store i32 %add, i32* %arrayidx2, align 4
%inc = add nuw nsw i32 %i.09, 1		%inc = add nuw nsw i32 %i.09, 1
%exitcond = icmp eq i32 %inc, %N		%exitcond = icmp eq i32 %inc, %N
br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !14		br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !14
}		}

		define dso_local i32 @i32_add_reduction(i32* noalias nocapture readonly %B, i32 %N) local_unnamed_addr #0 {
		; COMMON-LABEL: i32_add_reduction(
		; COMMON: entry:
		; CHECK: @llvm.get.active.lane.mask
		; NORED-NOT: @llvm.get.active.lane.mask
		; COMMON: }
		entry:
		%cmp6 = icmp sgt i32 %N, 0
		br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

		for.body.preheader:
		br label %for.body

		for.cond.cleanup.loopexit:
		%add.lcssa = phi i32 [ %add, %for.body ]
		br label %for.cond.cleanup

		for.cond.cleanup:
		%S.0.lcssa = phi i32 [ 1, %entry ], [ %add.lcssa, %for.cond.cleanup.loopexit ]
		ret i32 %S.0.lcssa

		for.body:
		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
		%S.07 = phi i32 [ %add, %for.body ], [ 1, %for.body.preheader ]
		%arrayidx = getelementptr inbounds i32, i32* %B, i32 %i.08
		%0 = load i32, i32* %arrayidx, align 4
		%add = add nsw i32 %0, %S.07
		%inc = add nuw nsw i32 %i.08, 1
		%exitcond = icmp eq i32 %inc, %N
		br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
		}


		; Don't tail-fold float reductions.
		;
		define dso_local void @f32_reduction(float* nocapture readonly %Input, i32 %N, float* nocapture %Output) local_unnamed_addr #0 {
		; CHECK-LABEL: f32_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp6 = icmp eq i32 %N, 0
		br i1 %cmp6, label %while.end, label %while.body.preheader

		while.body.preheader: ; preds = %entry
		br label %while.body

		while.body: ; preds = %while.body.preheader, %while.body
		%blkCnt.09 = phi i32 [ %dec, %while.body ], [ %N, %while.body.preheader ]
		%sum.08 = phi float [ %add, %while.body ], [ 0.000000e+00, %while.body.preheader ]
		%Input.addr.07 = phi float* [ %incdec.ptr, %while.body ], [ %Input, %while.body.preheader ]
		%incdec.ptr = getelementptr inbounds float, float* %Input.addr.07, i32 1
		%0 = load float, float* %Input.addr.07, align 4
		%add = fadd fast float %0, %sum.08
		%dec = add i32 %blkCnt.09, -1
		%cmp = icmp eq i32 %dec, 0
		br i1 %cmp, label %while.end.loopexit, label %while.body

		while.end.loopexit: ; preds = %while.body
		%add.lcssa = phi float [ %add, %while.body ]
		br label %while.end

		while.end: ; preds = %while.end.loopexit, %entry
		%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %add.lcssa, %while.end.loopexit ]
		%conv = uitofp i32 %N to float
		%div = fdiv fast float %sum.0.lcssa, %conv
		store float %div, float* %Output, align 4
		ret void
		}

		; Don't tail-fold float reductions.
		;
		define dso_local void @mixed_f32_i32_reduction(float* nocapture readonly %fInput, i32* nocapture readonly %iInput, i32 %N, float* nocapture %fOutput, i32* nocapture %iOutput) local_unnamed_addr #0 {
		; CHECK-LABEL: mixed_f32_i32_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp15 = icmp eq i32 %N, 0
		br i1 %cmp15, label %while.end, label %while.body.preheader

		while.body.preheader:
		br label %while.body

		while.body:
		%blkCnt.020 = phi i32 [ %dec, %while.body ], [ %N, %while.body.preheader ]
		%isum.019 = phi i32 [ %add2, %while.body ], [ 0, %while.body.preheader ]
		%fsum.018 = phi float [ %add, %while.body ], [ 0.000000e+00, %while.body.preheader ]
		%fInput.addr.017 = phi float* [ %incdec.ptr, %while.body ], [ %fInput, %while.body.preheader ]
		%iInput.addr.016 = phi i32* [ %incdec.ptr1, %while.body ], [ %iInput, %while.body.preheader ]
		%incdec.ptr = getelementptr inbounds float, float* %fInput.addr.017, i32 1
		%incdec.ptr1 = getelementptr inbounds i32, i32* %iInput.addr.016, i32 1
		%0 = load i32, i32* %iInput.addr.016, align 4
		%add2 = add nsw i32 %0, %isum.019
		%1 = load float, float* %fInput.addr.017, align 4
		%add = fadd fast float %1, %fsum.018
		%dec = add i32 %blkCnt.020, -1
		%cmp = icmp eq i32 %dec, 0
		br i1 %cmp, label %while.end.loopexit, label %while.body

		while.end.loopexit:
		%add.lcssa = phi float [ %add, %while.body ]
		%add2.lcssa = phi i32 [ %add2, %while.body ]
		%phitmp = sitofp i32 %add2.lcssa to float
		br label %while.end

		while.end:
		%fsum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %add.lcssa, %while.end.loopexit ]
		%isum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %phitmp, %while.end.loopexit ]
		%conv = uitofp i32 %N to float
		%div = fdiv fast float %fsum.0.lcssa, %conv
		store float %div, float* %fOutput, align 4
		%div5 = fdiv fast float %isum.0.lcssa, %conv
		%conv6 = fptosi float %div5 to i32
		store i32 %conv6, i32* %iOutput, align 4
		ret void
		}

		define dso_local i32 @i32_mul_reduction(i32* noalias nocapture readonly %B, i32 %N) local_unnamed_addr #0 {
		; CHECK-LABEL: i32_mul_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp6 = icmp sgt i32 %N, 0
		br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

		for.body.preheader:
		br label %for.body

		for.cond.cleanup.loopexit:
		%mul.lcssa = phi i32 [ %mul, %for.body ]
		br label %for.cond.cleanup

		for.cond.cleanup:
		%S.0.lcssa = phi i32 [ 1, %entry ], [ %mul.lcssa, %for.cond.cleanup.loopexit ]
		ret i32 %S.0.lcssa

		for.body:
		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
		%S.07 = phi i32 [ %mul, %for.body ], [ 1, %for.body.preheader ]
		%arrayidx = getelementptr inbounds i32, i32* %B, i32 %i.08
		%0 = load i32, i32* %arrayidx, align 4
		%mul = mul nsw i32 %0, %S.07
		%inc = add nuw nsw i32 %i.08, 1
		%exitcond = icmp eq i32 %inc, %N
		br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
		}

		define dso_local i32 @i32_or_reduction(i32* noalias nocapture readonly %B, i32 %N) local_unnamed_addr #0 {
		; CHECK-LABEL: i32_or_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp6 = icmp sgt i32 %N, 0
		br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

		for.body.preheader: ; preds = %entry
		br label %for.body

		for.cond.cleanup.loopexit: ; preds = %for.body
		%or.lcssa = phi i32 [ %or, %for.body ]
		br label %for.cond.cleanup

		for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
		%S.0.lcssa = phi i32 [ 1, %entry ], [ %or.lcssa, %for.cond.cleanup.loopexit ]
		ret i32 %S.0.lcssa

		for.body: ; preds = %for.body.preheader, %for.body
		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
		%S.07 = phi i32 [ %or, %for.body ], [ 1, %for.body.preheader ]
		%arrayidx = getelementptr inbounds i32, i32* %B, i32 %i.08
		%0 = load i32, i32* %arrayidx, align 4
		%or = or i32 %0, %S.07
		%inc = add nuw nsw i32 %i.08, 1
		%exitcond = icmp eq i32 %inc, %N
		br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
		}

		define dso_local i32 @i32_and_reduction(i32* noalias nocapture readonly %A, i32 %N, i32 %S) local_unnamed_addr #0 {
		; CHECK-LABEL: i32_and_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp5 = icmp sgt i32 %N, 0
		br i1 %cmp5, label %for.body.preheader, label %for.cond.cleanup

		for.body.preheader: ; preds = %entry
		br label %for.body

		for.cond.cleanup.loopexit: ; preds = %for.body
		%and.lcssa = phi i32 [ %and, %for.body ]
		br label %for.cond.cleanup

		for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
		%S.addr.0.lcssa = phi i32 [ %S, %entry ], [ %and.lcssa, %for.cond.cleanup.loopexit ]
		ret i32 %S.addr.0.lcssa

		for.body: ; preds = %for.body.preheader, %for.body
		%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
		%S.addr.06 = phi i32 [ %and, %for.body ], [ %S, %for.body.preheader ]
		%arrayidx = getelementptr inbounds i32, i32* %A, i32 %i.07
		%0 = load i32, i32* %arrayidx, align 4
		%and = and i32 %0, %S.addr.06
		%inc = add nuw nsw i32 %i.07, 1
		%exitcond = icmp eq i32 %inc, %N
		br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
		}

		define i32 @i32_smin_reduction(i32* nocapture readonly %x, i32 %n) #0 {
		; CHECK-LABEL: i32_smin_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp6 = icmp sgt i32 %n, 0
		br i1 %cmp6, label %for.body, label %for.cond.cleanup

		for.body: ; preds = %entry, %for.body
		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
		%r.07 = phi i32 [ %add, %for.body ], [ 2147483647, %entry ]
		%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08
		%0 = load i32, i32* %arrayidx, align 4
		%c = icmp slt i32 %r.07, %0
		%add = select i1 %c, i32 %r.07, i32 %0
		%inc = add nuw nsw i32 %i.08, 1
		%exitcond = icmp eq i32 %inc, %n
		br i1 %exitcond, label %for.cond.cleanup, label %for.body

		for.cond.cleanup: ; preds = %for.body, %entry
		%r.0.lcssa = phi i32 [ 2147483647, %entry ], [ %add, %for.body ]
		ret i32 %r.0.lcssa
		}

		define i32 @i32_smax_reduction(i32* nocapture readonly %x, i32 %n) #0 {
		; CHECK-LABEL: i32_smax_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp6 = icmp sgt i32 %n, 0
		br i1 %cmp6, label %for.body, label %for.cond.cleanup

		for.body: ; preds = %entry, %for.body
		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
		%r.07 = phi i32 [ %add, %for.body ], [ -2147483648, %entry ]
		%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08
		%0 = load i32, i32* %arrayidx, align 4
		%c = icmp sgt i32 %r.07, %0
		%add = select i1 %c, i32 %r.07, i32 %0
		%inc = add nuw nsw i32 %i.08, 1
		%exitcond = icmp eq i32 %inc, %n
		br i1 %exitcond, label %for.cond.cleanup, label %for.body

		for.cond.cleanup: ; preds = %for.body, %entry
		%r.0.lcssa = phi i32 [ -2147483648, %entry ], [ %add, %for.body ]
		ret i32 %r.0.lcssa
		}

		define i32 @i32_umin_reduction(i32* nocapture readonly %x, i32 %n) #0 {
		; CHECK-LABEL: i32_umin_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp6 = icmp sgt i32 %n, 0
		br i1 %cmp6, label %for.body, label %for.cond.cleanup

		for.body: ; preds = %entry, %for.body
		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
		%r.07 = phi i32 [ %add, %for.body ], [ 4294967295, %entry ]
		%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08
		%0 = load i32, i32* %arrayidx, align 4
		%c = icmp ult i32 %r.07, %0
		%add = select i1 %c, i32 %r.07, i32 %0
		%inc = add nuw nsw i32 %i.08, 1
		%exitcond = icmp eq i32 %inc, %n
		br i1 %exitcond, label %for.cond.cleanup, label %for.body

		for.cond.cleanup: ; preds = %for.body, %entry
		%r.0.lcssa = phi i32 [ 4294967295, %entry ], [ %add, %for.body ]
		ret i32 %r.0.lcssa
		}

		define i32 @i32_umax_reduction(i32* nocapture readonly %x, i32 %n) #0 {
		; CHECK-LABEL: i32_umax_reduction(
		; CHECK: vector.body:
		; CHECK-NOT: @llvm.masked.load
		; CHECK-NOT: @llvm.masked.store
		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body
		entry:
		%cmp6 = icmp sgt i32 %n, 0
		br i1 %cmp6, label %for.body, label %for.cond.cleanup

		for.body: ; preds = %entry, %for.body
		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
		%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
		%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08
		%0 = load i32, i32* %arrayidx, align 4
		%c = icmp ugt i32 %r.07, %0
		%add = select i1 %c, i32 %r.07, i32 %0
		%inc = add nuw nsw i32 %i.08, 1
		%exitcond = icmp eq i32 %inc, %n
		br i1 %exitcond, label %for.cond.cleanup, label %for.body

		for.cond.cleanup: ; preds = %for.body, %entry
		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
		ret i32 %r.0.lcssa
		}

; CHECK: !0 = distinct !{!0, !1}		; CHECK: !0 = distinct !{!0, !1}
; CHECK-NEXT: !1 = !{!"llvm.loop.isvectorized", i32 1}		; CHECK-NEXT: !1 = !{!"llvm.loop.isvectorized", i32 1}
; CHECK-NEXT: !2 = distinct !{!2, !3, !1}		; CHECK-NEXT: !2 = distinct !{!2, !3, !1}
; CHECK-NEXT: !3 = !{!"llvm.loop.unroll.runtime.disable"}		; CHECK-NEXT: !3 = !{!"llvm.loop.unroll.runtime.disable"}
; CHECK-NEXT: !4 = distinct !{!4, !1}		; CHECK-NEXT: !4 = distinct !{!4, !1}
; CHECK-NEXT: !5 = distinct !{!5, !3, !1}		; CHECK-NEXT: !5 = distinct !{!5, !3, !1}
; CHECK-NEXT: !6 = distinct !{!6, !1}		; CHECK-NEXT: !6 = distinct !{!6, !1}

Show All 12 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[ARM][MVE] Only tail-fold integer add reductionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 277717

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

[ARM][MVE] Only tail-fold integer add reductions
ClosedPublic