This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
scalar_interleave.ll

Differential D118566

[LoopVectorizer] Don't perform interleaving of predicated scalar loops
ClosedPublic

Authored by dmgreen on Jan 30 2022, 6:31 AM.

Download Raw Diff

Details

Reviewers

fhahn
david-arm
sdesmalen
spatel
wsmoses

Commits

rGb4c6d1bb3791: [LoopVectorizer] Don't perform interleaving of predicated scalar loops

Summary

The vectorizer will choose at times to "vectorize" loop with a scalar factor (VF=1) with interleaving (IC > 1). This can produce better code than the unroller (notable for reductions where it can produce independent reduction chains that are combined after the loop). At times this is not very beneficial though, for example when runtime checks are needed or when the scalar code requires predication.

This addresses the second point, preventing the vectorizer from interleaving when the scalar loop will require predication. This prevents it from making a bit of a mess, that is worse than the original and better left for the unroller to unroll if beneficial. It helps reverse some of the regressions from D118090.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.Jan 30 2022, 6:31 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptJan 30 2022, 6:31 AM

dmgreen requested review of this revision.Jan 30 2022, 6:31 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 30 2022, 6:31 AM

Harbormaster completed remote builds in B146532: Diff 404374.Jan 30 2022, 6:31 AM

This prevents it from making a bit of a mess, that is worse than the original and better left for the unroller to unroll if beneficial

Can you expand a little bit on why this becomes a bit of a mess? The original scalar loop has control-flow for predication as well, so I guess interleaving would just duplicate such control flow for the second scalar iteration. Is the code generated by the LV less efficient or are we missing any folds/simplification? Or is there a fundamental reason this can never be an improvement? I could imagine a scenario where most of the loop body would benefit from interleaving, but one statement in the loop doesn't because of predication, it would still be beneficial to interleave.

In D118566#3283274, @sdesmalen wrote:

This prevents it from making a bit of a mess, that is worse than the original and better left for the unroller to unroll if beneficial

Can you expand a little bit on why this becomes a bit of a mess? The original scalar loop has control-flow for predication as well, so I guess interleaving would just duplicate such control flow for the second scalar iteration. Is the code generated by the LV less efficient or are we missing any folds/simplification? Or is there a fundamental reason this can never be an improvement? I could imagine a scenario where most of the loop body would benefit from interleaving, but one statement in the loop doesn't because of predication, it would still be beneficial to interleave.

It is less efficient in all the benchmarks I've ran. It won't come up very often - we usually either choose to vectorize or won't choose to interleave. Interleaving is generally only done for smallish loops. When the vectorizer is forced to make serialized predicate blocks (and possibly add scev checks, as in the testcase) - it's hard to see how the code could be so much better than it is now. The patch gives a 50-60% improvement in the places it helps.

Whatever happens, it is best to leave it for the unroller to unroll with its own profitability heuristics (which in this case, it likely will not).

Hi @dmgreen, from what you're saying it sounds like perhaps the problem is due to two related things:

The cost model for interleaving predicated operations is broken in some cases and needs fixing in the long term (since if the code it produces is of such a low quality then it has probably vastly underestimated the cost). I'm not sure if selectInterleaveCount really takes the cost into account for the VF=1 case - it lseems to be a selection of bolted-on workarounds/guesses due to the lack of a proper cost model.
The loop vectoriser is basically rubbish at unrolling and/or scalarising predicated operations and ultimately in the long term we probably want to fix this. I imagine this is also a problem for VF>1 when the predicated operation has to be scalarised.

One thing to note about your patch is that you are calling blockNeedsPredicationForAnyReason, which includes predicated loops (tail-folding). That may be the right thing to do in this case, but I think it's worth adding a test case for it at least?

In D118566#3286623, @dmgreen wrote:

In D118566#3283274, @sdesmalen wrote:

This prevents it from making a bit of a mess, that is worse than the original and better left for the unroller to unroll if beneficial

Can you expand a little bit on why this becomes a bit of a mess? The original scalar loop has control-flow for predication as well, so I guess interleaving would just duplicate such control flow for the second scalar iteration. Is the code generated by the LV less efficient or are we missing any folds/simplification? Or is there a fundamental reason this can never be an improvement? I could imagine a scenario where most of the loop body would benefit from interleaving, but one statement in the loop doesn't because of predication, it would still be beneficial to interleave.

It is less efficient in all the benchmarks I've ran. It won't come up very often - we usually either choose to vectorize or won't choose to interleave. Interleaving is generally only done for smallish loops. When the vectorizer is forced to make serialized predicate blocks (and possibly add scev checks, as in the testcase) - it's hard to see how the code could be so much better than it is now. The patch gives a 50-60% improvement in the places it helps.

Whatever happens, it is best to leave it for the unroller to unroll with its own profitability heuristics (which in this case, it likely will not).

Okay, I think I see what you mean now. Because the block is predicated, the LV will try to predicate each operation that needs predication (e.g. every load/store), so that we end up executing NumPredicatedOps * UF branch instructions.

I guess the SCEVChecks would be something we could add to the cost-model in a separate patch, e.g. if the LV has to generate SCEVChecks at all don't bother setting UF>1 iff VF=1.

This revision is now accepted and ready to land.Feb 2 2022, 8:02 AM

This revision was landed with ongoing or failed builds.Feb 7 2022, 11:34 AM

Closed by commit rGb4c6d1bb3791: [LoopVectorizer] Don't perform interleaving of predicated scalar loops (authored by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rGb4c6d1bb3791: [LoopVectorizer] Don't perform interleaving of predicated scalar loops.

Hi, @dmgreen, is it necessary to add ScalarInterleavingRequiresPredication to the guard ? I met some cases that the current patch wants to prevent from interleaving but still be interleaved, as AggressivelyInterleaveReductions is enabled on our target.

if (AggressivelyInterleaveReductions && !ScalarInterleavingRequiresPredication) {
  LLVM_DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");
  return IC;
}

Herald added a project: Restricted Project. · View Herald TranscriptJun 22 2022, 6:19 PM

Hello.

Hmm. I remember that this was important for performance in some cases on AArch64 - but we don't enable AggressivelyInterleaveReductions. If you have examples of scalar interleaving causing performance problems with AggressivelyInterleaveReductions then it sounds sensible to me to disable it for the if you pointed to as well.

In D118566#3605004, @dmgreen wrote:

Hello.

Hmm. I remember that this was important for performance in some cases on AArch64 - but we don't enable AggressivelyInterleaveReductions. If you have examples of scalar interleaving causing performance problems with AggressivelyInterleaveReductions then it sounds sensible to me to disable it for the if you pointed to as well.

OK, thanks! I'll try to extract an example from the benchmark to illustrate my point. By the way, I wonder why the patch does not return 1 after detecting that ScalarInterleavingRequiresPredication is true. Is this patch not intended to filter all predicated scalar loops?

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

15 lines

test/

Transforms/

LoopVectorize/

AArch64/

scalar_interleave.ll

95 lines

Diff 406547

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,130 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,

// Interleave if we vectorized this loop and there is a reduction that could		// Interleave if we vectorized this loop and there is a reduction that could
// benefit from interleaving.		// benefit from interleaving.
if (VF.isVector() && HasReductions) {		if (VF.isVector() && HasReductions) {
LLVM_DEBUG(dbgs() << "LV: Interleaving because of reductions.\n");		LLVM_DEBUG(dbgs() << "LV: Interleaving because of reductions.\n");
return IC;		return IC;
}		}

// Note that if we've already vectorized the loop we will have done the		// For any scalar loop that either requires runtime checks or predication we
// runtime check and so interleaving won't require further checks.		// are better off leaving this to the unroller. Note that if we've already
bool InterleavingRequiresRuntimePointerCheck =		// vectorized the loop we will have done the runtime check and so interleaving
		// won't require further checks.
		bool ScalarInterleavingRequiresPredication =
		(VF.isScalar() && any_of(TheLoop->blocks(), [this](BasicBlock *BB) {
		return Legal->blockNeedsPredication(BB);
		}));
		bool ScalarInterleavingRequiresRuntimePointerCheck =
(VF.isScalar() && Legal->getRuntimePointerChecking()->Need);		(VF.isScalar() && Legal->getRuntimePointerChecking()->Need);

// We want to interleave small loops in order to reduce the loop overhead and		// We want to interleave small loops in order to reduce the loop overhead and
// potentially expose ILP opportunities.		// potentially expose ILP opportunities.
LLVM_DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n'		LLVM_DEBUG(dbgs() << "LV: Loop cost is " << LoopCost << '\n'
<< "LV: IC is " << IC << '\n'		<< "LV: IC is " << IC << '\n'
<< "LV: VF is " << VF << '\n');		<< "LV: VF is " << VF << '\n');
const bool AggressivelyInterleaveReductions =		const bool AggressivelyInterleaveReductions =
TTI.enableAggressiveInterleaving(HasReductions);		TTI.enableAggressiveInterleaving(HasReductions);
if (!InterleavingRequiresRuntimePointerCheck && LoopCost < SmallLoopCost) {		if (!ScalarInterleavingRequiresRuntimePointerCheck &&
		!ScalarInterleavingRequiresPredication && LoopCost < SmallLoopCost) {
// We assume that the cost overhead is 1 and we use the cost model		// We assume that the cost overhead is 1 and we use the cost model
// to estimate the cost of the loop and interleave until the cost of the		// to estimate the cost of the loop and interleave until the cost of the
// loop overhead is about 5% of the cost of the loop.		// loop overhead is about 5% of the cost of the loop.
unsigned SmallIC =		unsigned SmallIC =
std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost));		std::min(IC, (unsigned)PowerOf2Floor(SmallLoopCost / LoopCost));

// Interleave until store/load ports (estimated by max interleave count) are		// Interleave until store/load ports (estimated by max interleave count) are
// saturated.		// saturated.
▲ Show 20 Lines • Show All 4,621 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalar_interleave.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -loop-vectorize -S -o - < %s \| FileCheck %s			; RUN: opt -loop-vectorize -S -o - < %s \| FileCheck %s
				; RUN: opt -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S -o - < %s \| FileCheck %s

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64-arm-none-eabi"			target triple = "aarch64-arm-none-eabi"

	; This test is not vectorized on AArch64 due to requiring predicated loads.			; This test is not vectorized on AArch64 due to requiring predicated loads.
	; It should also not be interleaved as the predicated interleaving will just			; It should also not be interleaved as the predicated interleaving will just
	; create less efficient code.			; create less efficient code.

	Show All 31 Lines
	; CHECK-NEXT: [[CMP27:%.*]] = phi i64 [ 1, [[IF_THEN]] ], [ -1, [[IF_THEN6]] ], [ 1, [[IF_ELSE]] ]			; CHECK-NEXT: [[CMP27:%.*]] = phi i64 [ 1, [[IF_THEN]] ], [ -1, [[IF_THEN6]] ], [ 1, [[IF_ELSE]] ]
	; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[SRCBLEN]], [[SRCALEN]]			; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[SRCBLEN]], [[SRCALEN]]
	; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[TMP0]], -1			; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[TMP0]], -1
	; CHECK-NEXT: br label [[FOR_COND14_PREHEADER:%.*]]			; CHECK-NEXT: br label [[FOR_COND14_PREHEADER:%.*]]
	; CHECK: for.cond14.preheader:			; CHECK: for.cond14.preheader:
	; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i32 [ 1, [[IF_END12]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_END:%.*]] ]			; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i32 [ 1, [[IF_END12]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_END:%.*]] ]
	; CHECK-NEXT: [[I_077:%.]] = phi i32 [ 0, [[IF_END12]] ], [ [[INC33:%.]], [[FOR_END]] ]			; CHECK-NEXT: [[I_077:%.]] = phi i32 [ 0, [[IF_END12]] ], [ [[INC33:%.]], [[FOR_END]] ]
	; CHECK-NEXT: [[PDST_ADDR_176:%.]] = phi half [ [[PDST_ADDR_0]], [[IF_END12]] ], [ [[PDST_ADDR_2:%.*]], [[FOR_END]] ]			; CHECK-NEXT: [[PDST_ADDR_176:%.]] = phi half [ [[PDST_ADDR_0]], [[IF_END12]] ], [ [[PDST_ADDR_2:%.*]], [[FOR_END]] ]
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[INDVARS_IV]], 2
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]
	; CHECK: vector.scevcheck:
	; CHECK-NEXT: [[MUL1:%.*]] = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 1, i32 [[I_077]])
	; CHECK-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i32, i1 } [[MUL1]], 0
	; CHECK-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i32, i1 } [[MUL1]], 1
	; CHECK-NEXT: [[TMP2:%.*]] = sub i32 [[I_077]], [[MUL_RESULT]]
	; CHECK-NEXT: [[TMP3:%.*]] = icmp sgt i32 [[TMP2]], [[I_077]]
	; CHECK-NEXT: [[TMP4:%.*]] = or i1 [[TMP3]], [[MUL_OVERFLOW]]
	; CHECK-NEXT: br i1 [[TMP4]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
	; CHECK: vector.ph:
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[INDVARS_IV]], 2
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[INDVARS_IV]], [[N_MOD_VF]]
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_LOAD_CONTINUE9:%.*]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi half [ 0xH0000, [[VECTOR_PH]] ], [ [[PREDPHI:%.]], [[PRED_LOAD_CONTINUE9]] ]
	; CHECK-NEXT: [[VEC_PHI3:%.]] = phi half [ 0xH0000, [[VECTOR_PH]] ], [ [[PREDPHI10:%.]], [[PRED_LOAD_CONTINUE9]] ]
	; CHECK-NEXT: [[INDUCTION:%.*]] = add i32 [[INDEX]], 0
	; CHECK-NEXT: [[INDUCTION2:%.*]] = add i32 [[INDEX]], 1
	; CHECK-NEXT: [[TMP5:%.*]] = sub i32 [[I_077]], [[INDUCTION]]
	; CHECK-NEXT: [[TMP6:%.*]] = sub i32 [[I_077]], [[INDUCTION2]]
	; CHECK-NEXT: [[TMP7:%.*]] = icmp ult i32 [[TMP5]], [[SRCBLEN_ADDR_0]]
	; CHECK-NEXT: [[TMP8:%.*]] = icmp ult i32 [[TMP6]], [[SRCBLEN_ADDR_0]]
	; CHECK-NEXT: [[TMP9:%.*]] = icmp ult i32 [[INDUCTION]], [[SRCALEN_ADDR_0]]
	; CHECK-NEXT: [[TMP10:%.*]] = icmp ult i32 [[INDUCTION2]], [[SRCALEN_ADDR_0]]
	; CHECK-NEXT: [[TMP11:%.*]] = and i1 [[TMP9]], [[TMP7]]
	; CHECK-NEXT: [[TMP12:%.*]] = and i1 [[TMP10]], [[TMP8]]
	; CHECK-NEXT: br i1 [[TMP11]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
	; CHECK: pred.load.if:
	; CHECK-NEXT: [[TMP13:%.*]] = zext i32 [[INDUCTION]] to i64
	; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds half, half [[PIN1_0]], i64 [[TMP13]]
	; CHECK-NEXT: [[TMP15:%.]] = load half, half [[TMP14]], align 2
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE]]
	; CHECK: pred.load.continue:
	; CHECK-NEXT: [[TMP16:%.*]] = phi half [ poison, [[VECTOR_BODY]] ], [ [[TMP15]], [[PRED_LOAD_IF]] ]
	; CHECK-NEXT: br i1 [[TMP12]], label [[PRED_LOAD_IF4:%.]], label [[PRED_LOAD_CONTINUE5:%.]]
	; CHECK: pred.load.if4:
	; CHECK-NEXT: [[TMP17:%.*]] = zext i32 [[INDUCTION2]] to i64
	; CHECK-NEXT: [[TMP18:%.]] = getelementptr inbounds half, half [[PIN1_0]], i64 [[TMP17]]
	; CHECK-NEXT: [[TMP19:%.]] = load half, half [[TMP18]], align 2
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE5]]
	; CHECK: pred.load.continue5:
	; CHECK-NEXT: [[TMP20:%.*]] = phi half [ poison, [[PRED_LOAD_CONTINUE]] ], [ [[TMP19]], [[PRED_LOAD_IF4]] ]
	; CHECK-NEXT: br i1 [[TMP11]], label [[PRED_LOAD_IF6:%.]], label [[PRED_LOAD_CONTINUE7:%.]]
	; CHECK: pred.load.if6:
	; CHECK-NEXT: [[TMP21:%.*]] = sub nsw i32 0, [[TMP5]]
	; CHECK-NEXT: [[TMP22:%.*]] = sext i32 [[TMP21]] to i64
	; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds half, half [[PIN2_0]], i64 [[TMP22]]
	; CHECK-NEXT: [[TMP24:%.]] = load half, half [[TMP23]], align 2
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE7]]
	; CHECK: pred.load.continue7:
	; CHECK-NEXT: [[TMP25:%.*]] = phi half [ poison, [[PRED_LOAD_CONTINUE5]] ], [ [[TMP24]], [[PRED_LOAD_IF6]] ]
	; CHECK-NEXT: br i1 [[TMP12]], label [[PRED_LOAD_IF8:%.*]], label [[PRED_LOAD_CONTINUE9]]
	; CHECK: pred.load.if8:
	; CHECK-NEXT: [[TMP26:%.*]] = sub nsw i32 0, [[TMP6]]
	; CHECK-NEXT: [[TMP27:%.*]] = sext i32 [[TMP26]] to i64
	; CHECK-NEXT: [[TMP28:%.]] = getelementptr inbounds half, half [[PIN2_0]], i64 [[TMP27]]
	; CHECK-NEXT: [[TMP29:%.]] = load half, half [[TMP28]], align 2
	; CHECK-NEXT: br label [[PRED_LOAD_CONTINUE9]]
	; CHECK: pred.load.continue9:
	; CHECK-NEXT: [[TMP30:%.*]] = phi half [ poison, [[PRED_LOAD_CONTINUE7]] ], [ [[TMP29]], [[PRED_LOAD_IF8]] ]
	; CHECK-NEXT: [[TMP31:%.*]] = fmul fast half [[TMP25]], [[TMP16]]
	; CHECK-NEXT: [[TMP32:%.*]] = fmul fast half [[TMP30]], [[TMP20]]
	; CHECK-NEXT: [[TMP33:%.*]] = fadd fast half [[TMP31]], [[VEC_PHI]]
	; CHECK-NEXT: [[TMP34:%.*]] = fadd fast half [[TMP32]], [[VEC_PHI3]]
	; CHECK-NEXT: [[TMP35:%.*]] = xor i1 [[TMP11]], true
	; CHECK-NEXT: [[TMP36:%.*]] = xor i1 [[TMP12]], true
	; CHECK-NEXT: [[PREDPHI]] = select i1 [[TMP35]], half [[VEC_PHI]], half [[TMP33]]
	; CHECK-NEXT: [[PREDPHI10]] = select i1 [[TMP36]], half [[VEC_PHI3]], half [[TMP34]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 2
	; CHECK-NEXT: [[TMP37:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP37]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:
	; CHECK-NEXT: [[BIN_RDX:%.*]] = fadd fast half [[PREDPHI10]], [[PREDPHI]]
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[INDVARS_IV]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_COND14_PREHEADER]] ], [ 0, [[VECTOR_SCEVCHECK]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi half [ 0xH0000, [[VECTOR_SCEVCHECK]] ], [ 0xH0000, [[FOR_COND14_PREHEADER]] ], [ [[BIN_RDX]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: br label [[FOR_BODY16:%.*]]			; CHECK-NEXT: br label [[FOR_BODY16:%.*]]
	; CHECK: for.body16:			; CHECK: for.body16:
	; CHECK-NEXT: [[J_074:%.]] = phi i32 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INC:%.]], [[FOR_INC:%.*]] ]			; CHECK-NEXT: [[J_074:%.]] = phi i32 [ 0, [[FOR_COND14_PREHEADER]] ], [ [[INC:%.]], [[FOR_INC:%.*]] ]
	; CHECK-NEXT: [[SUM_073:%.]] = phi half [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[SUM_1:%.]], [[FOR_INC]] ]			; CHECK-NEXT: [[SUM_073:%.]] = phi half [ 0xH0000, [[FOR_COND14_PREHEADER]] ], [ [[SUM_1:%.]], [[FOR_INC]] ]
	; CHECK-NEXT: [[SUB17:%.*]] = sub i32 [[I_077]], [[J_074]]			; CHECK-NEXT: [[SUB17:%.*]] = sub i32 [[I_077]], [[J_074]]
	; CHECK-NEXT: [[CMP18:%.*]] = icmp ult i32 [[SUB17]], [[SRCBLEN_ADDR_0]]			; CHECK-NEXT: [[CMP18:%.*]] = icmp ult i32 [[SUB17]], [[SRCBLEN_ADDR_0]]
	; CHECK-NEXT: [[CMP19:%.*]] = icmp ult i32 [[J_074]], [[SRCALEN_ADDR_0]]			; CHECK-NEXT: [[CMP19:%.*]] = icmp ult i32 [[J_074]], [[SRCALEN_ADDR_0]]
	; CHECK-NEXT: [[OR_COND:%.*]] = and i1 [[CMP19]], [[CMP18]]			; CHECK-NEXT: [[OR_COND:%.*]] = and i1 [[CMP19]], [[CMP18]]
	; CHECK-NEXT: br i1 [[OR_COND]], label [[IF_THEN20:%.*]], label [[FOR_INC]]			; CHECK-NEXT: br i1 [[OR_COND]], label [[IF_THEN20:%.*]], label [[FOR_INC]]
	; CHECK: if.then20:			; CHECK: if.then20:
	; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[J_074]] to i64			; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[J_074]] to i64
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds half, half [[PIN1_0]], i64 [[IDXPROM]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds half, half [[PIN1_0]], i64 [[IDXPROM]]
	; CHECK-NEXT: [[TMP38:%.]] = load half, half [[ARRAYIDX]], align 2			; CHECK-NEXT: [[TMP2:%.]] = load half, half [[ARRAYIDX]], align 2
	; CHECK-NEXT: [[SUB22:%.*]] = sub nsw i32 0, [[SUB17]]			; CHECK-NEXT: [[SUB22:%.*]] = sub nsw i32 0, [[SUB17]]
	; CHECK-NEXT: [[IDXPROM23:%.*]] = sext i32 [[SUB22]] to i64			; CHECK-NEXT: [[IDXPROM23:%.*]] = sext i32 [[SUB22]] to i64
	; CHECK-NEXT: [[ARRAYIDX24:%.]] = getelementptr inbounds half, half [[PIN2_0]], i64 [[IDXPROM23]]			; CHECK-NEXT: [[ARRAYIDX24:%.]] = getelementptr inbounds half, half [[PIN2_0]], i64 [[IDXPROM23]]
	; CHECK-NEXT: [[TMP39:%.]] = load half, half [[ARRAYIDX24]], align 2			; CHECK-NEXT: [[TMP3:%.]] = load half, half [[ARRAYIDX24]], align 2
	; CHECK-NEXT: [[MUL:%.*]] = fmul fast half [[TMP39]], [[TMP38]]			; CHECK-NEXT: [[MUL:%.*]] = fmul fast half [[TMP3]], [[TMP2]]
	; CHECK-NEXT: [[ADD25:%.*]] = fadd fast half [[MUL]], [[SUM_073]]			; CHECK-NEXT: [[ADD25:%.*]] = fadd fast half [[MUL]], [[SUM_073]]
	; CHECK-NEXT: br label [[FOR_INC]]			; CHECK-NEXT: br label [[FOR_INC]]
	; CHECK: for.inc:			; CHECK: for.inc:
	; CHECK-NEXT: [[SUM_1]] = phi half [ [[ADD25]], [[IF_THEN20]] ], [ [[SUM_073]], [[FOR_BODY16]] ]			; CHECK-NEXT: [[SUM_1]] = phi half [ [[ADD25]], [[IF_THEN20]] ], [ [[SUM_073]], [[FOR_BODY16]] ]
	; CHECK-NEXT: [[INC]] = add nuw i32 [[J_074]], 1			; CHECK-NEXT: [[INC]] = add nuw i32 [[J_074]], 1
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[INDVARS_IV]]			; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[INDVARS_IV]]
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY16]], !llvm.loop [[LOOP2:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END]], label [[FOR_BODY16]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: [[SUM_1_LCSSA:%.*]] = phi half [ [[SUM_1]], [[FOR_INC]] ], [ [[BIN_RDX]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUM_1_LCSSA:%.*]] = phi half [ [[SUM_1]], [[FOR_INC]] ]
	; CHECK-NEXT: [[PDST_ADDR_2]] = getelementptr inbounds half, half* [[PDST_ADDR_176]], i64 [[CMP27]]			; CHECK-NEXT: [[PDST_ADDR_2]] = getelementptr inbounds half, half* [[PDST_ADDR_176]], i64 [[CMP27]]
	; CHECK-NEXT: store half [[SUM_1_LCSSA]], half* [[PDST_ADDR_176]], align 2			; CHECK-NEXT: store half [[SUM_1_LCSSA]], half* [[PDST_ADDR_176]], align 2
	; CHECK-NEXT: [[INC33]] = add nuw i32 [[I_077]], 1			; CHECK-NEXT: [[INC33]] = add nuw i32 [[I_077]], 1
	; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add i32 [[INDVARS_IV]], 1			; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add i32 [[INDVARS_IV]], 1
	; CHECK-NEXT: [[EXITCOND78_NOT:%.*]] = icmp eq i32 [[INC33]], [[TMP1]]			; CHECK-NEXT: [[EXITCOND78_NOT:%.*]] = icmp eq i32 [[INC33]], [[TMP1]]
	; CHECK-NEXT: br i1 [[EXITCOND78_NOT]], label [[FOR_END34:%.*]], label [[FOR_COND14_PREHEADER]]			; CHECK-NEXT: br i1 [[EXITCOND78_NOT]], label [[FOR_END34:%.*]], label [[FOR_COND14_PREHEADER]]
	; CHECK: for.end34:			; CHECK: for.end34:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines