This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
6/13
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
1/2
fp80-widest-type.ll

Differential D80870

[LV] Make sure smallest/widest type sizes are powers-of-2.
ClosedPublic

Authored by fhahn on May 30 2020, 8:16 AM.

Download Raw Diff

Details

Reviewers

Ayal
gilr
rengolin

Commits

rGb446ec56a298: [LV] Make sure the MaxVF is a power-of-2 by rounding down.

Summary

LV currently only supports power of 2 vectorization factors, which has
been made explicit with the assertion added in
840450549c9199150cbdee29acef756c19660ca1.

However, if the widest type is not a power-of-2 the computed maxVF won't
be a power-of-2 either. This patch changes getSmallestAndWidestTypes to
round up to the next power-of-2. This can happen in practice for
x86_fp80, for example. Alternatively we could force the computed max VF
to the next-lowest power-of-2

Fixes PR46139.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.May 30 2020, 8:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 30 2020, 8:17 AM

Herald added subscribers: rkruppe, hiraditya. · View Herald Transcript

Harbormaster completed remote builds in B58544: Diff 267459.May 30 2020, 8:59 AM

Alternatively we could force the computed max VF to the next-lowest power-of-2

This alternative may perhaps be better, or another one, as noted inline.

Also added a couple of unrelated comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5011	The original motivation to assert that MaxVF is a power of 2, in D80491, was for concluding that any chosen VF will divide a constant trip count. Perhaps a better fix than D80491 would be (have been ..., as noted in the summary) to round MaxVF down to a power of 2 here, rather than forcing computeFeasibleMaxVF() and/or getSmallestAndWidestTypes() to return powers of 2.
5077	Unrelated: makes sense to limit MaxVF to ConstTripCount also if the latter is not a power of 2, by rounding it down to a power of 2 (or up, with fold-tail?).
5117	Unrelated: getMinimumVF() must also be (asserted to be) a power of 2; moreover, it must not cause MaxVF to exceed `MaxSafeRegisterWidth`.
5228	The critical part is to make sure VF does not exceed `MaxSafeRegisterWidth`; this indeed holds when rounding `MinWidth` and `MaxWidth` up to powers of 2; might not hold when rounding them down, conceptually. But perhaps non-power-of-2 sized types should be skipped when computing MinWidth and MaxWidth, related to the last "also" part of the FIXME above? cf. `hasIrregularType()`.
llvm/test/Transforms/LoopVectorize/X86/fp80-widest-type.ll
10	Worth indicating that this fixes PR46139; e.g., in a comment, test name, file name.

fhahn mentioned this in D80491: [LV] Clamp MaxVF to power of 2.Jun 1 2020, 7:44 AM

bjope added a subscriber: bjope.Jun 1 2020, 8:27 AM

bjope removed a subscriber: bjope.Jun 1 2020, 10:39 AM

bjope added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5011	Rounding down MaxVF returned by computeFeasibleMaxVF (maybe at the end of that function rather than here) would probably work better for our OOT target (with 160-bit register and i40 types, so neither the Smallest/WidestType nor the TTI.getRegisterBitWidth(true) returns a power-of-2 value in that situation).
5228	I did play around with ignoring "irregular" types here a few days ago. In most situations I think it is ok. But I also noticed that it is possible to for example get <4 x i15> in the vector body, with a loop doing `store i15 7, i15* ptr, align 2`. So ignoring the type when determining Smallest/Widest type does not completely prevent us from creating vectors with irregular types (as it seems). And that vector could be larger than the max register width. Although, considering that the resulting <4 x i15> vector store in my test will use a packed layout, that case actually looks like a bug (I'll probably end up writing a PR about it).

D80491 has been breaking our ToT builds for a week now. Can this be submitted soon. If not please consider reverting D80491.

Updated to round down in computeFeasibleMaxVF.

In D80870#2065079, @Ayal wrote:

Also added a couple of unrelated comments.

I'll take a look at those in a bit, when I have more time.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5011	Perhaps a better fix than D80491 would be (have been ..., as noted in the summary) to round MaxVF down to a power of 2 here, rather than forcing computeFeasibleMaxVF() and/or getSmallestAndWidestTypes() to return powers of 2. `computeFeasibleMaxVF` already has a comment indicating that the MaxVF should be a power-of-2 I think. I updated the patch to move the rounding down to computeFeasibleMaxVF, as other code might also implicitly rely on MaxVF being a power-of-2. It might be good to first ensure MaxVF to be a power of 2 in `computeFeasibleMaxVF` in the short term and then check if the other uses work as expected with non-power-of-2's subsequently, if required/desired. I think on most in-tree targets it is unlikely to have much impact, but it would be interesting to hear if the out-of-tree target @bjope mentioned actually supports non-power-of-2 vectorization factors (e.g. vector add of `5 x i40`)?

In D80870#2066895, @manojgupta wrote:

D80491 has been breaking our ToT builds for a week now. Can this be submitted soon. If not please consider reverting D80491.

Sorry about this. But note that reverting D80491 will only remove an assert, leading to potentially wrong code being generated silently.
Perhaps additional lit tests could be devised to help exercise failing behaviors(?)

In D80870#2067120, @fhahn wrote:

Updated to round down in computeFeasibleMaxVF.

Update looks good to me, thanks! Added a few minor comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5061	Suggest to add to the comment that type sizes may also not be powers of 2.
5061–5062	Rounding WidestRegister down to a power of 2 is no longer needed; rounding-down MaxVectorSize alone suffices to ensure the method returns a power of 2.
5121	This should be redundant; i.e., at this point MaxVF can be asserted to be a power of 2. Only way it might not be is because of getMinimumVF(), which, being a minimum, should probably not be rounded down silently (see above comment).
llvm/test/Transforms/LoopVectorize/X86/fp80-widest-type.ll
8	Comment should be updated. E.g., ";Make sure non-power-of-2 types are handled correctly, i.e., MaxVF is still a power-of-2."

Adjust comments, drop unnecessary rounding.

fhahn marked 6 inline comments as done.Jun 1 2020, 1:54 PM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5061–5062	Right, the only other use to compute the loop bound below should be fine.
5121	This should be redundant; i.e., at this point MaxVF can be asserted to be a power of 2. Only way it might not be is because of getMinimumVF(), which, being a minimum, should probably not be rounded down silently (see above comment). Right. Otherwise I am sure some test case will surface, which can then be added :)

Thanks!

This revision is now accepted and ready to land.Jun 1 2020, 2:21 PM

Harbormaster completed remote builds in B58660: Diff 267703.Jun 1 2020, 3:43 PM

Harbormaster completed remote builds in B58669: Diff 267724.Jun 1 2020, 4:16 PM

DaniilSuchkov added a subscriber: DaniilSuchkov.Jun 2 2020, 12:28 AM

Closed by commit rGb446ec56a298: [LV] Make sure the MaxVF is a power-of-2 by rounding down. (authored by fhahn). · Explain WhyJun 2 2020, 2:46 AM

This revision was automatically updated to reflect the committed changes.

fhahn marked 2 inline comments as done.

bjope added inline comments.Jun 3 2020, 6:13 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5011	The VF is power-of-2 for the targets we support (but we got some registers and element sizes that aren't power-of-2). And this patch (as it landed) seem to solve the problems we had with D80491. Thanks!

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

10 lines

test/

Transforms/

LoopVectorize/

X86/

fp80-widest-type.ll

40 lines

Diff 267831

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,317 Lines • ▼ Show 20 Lines	void invalidateCostModelingDecisions() {
WideningDecisions.clear();		WideningDecisions.clear();
Uniforms.clear();		Uniforms.clear();
Scalars.clear();		Scalars.clear();
}		}

private:		private:
unsigned NumPredStores = 0;		unsigned NumPredStores = 0;

/// \return An upper bound for the vectorization factor, larger than zero.		/// \return An upper bound for the vectorization factor, a power-of-2 larger
/// One is returned if vectorization should best be avoided due to cost.		/// than zero. One is returned if vectorization should best be avoided due
		/// to cost.
unsigned computeFeasibleMaxVF(unsigned ConstTripCount);		unsigned computeFeasibleMaxVF(unsigned ConstTripCount);

/// The vectorization cost is a combination of the cost itself and a boolean		/// The vectorization cost is a combination of the cost itself and a boolean
/// indicating whether any of the contributing operations will actually		/// indicating whether any of the contributing operations will actually
/// operate on		/// operate on
/// vector values after type legalization in the backend. If this latter value		/// vector values after type legalization in the backend. If this latter value
/// is		/// is
/// false, then all operations will be scalarized (i.e. no vectorization has		/// false, then all operations will be scalarized (i.e. no vectorization has
▲ Show 20 Lines • Show All 3,666 Lines • ▼ Show 20 Lines	Optional<unsigned> LoopVectorizationCostModel::computeMaxVF(unsigned UserVF,
if (!useMaskedInterleavedAccesses(TTI)) {		if (!useMaskedInterleavedAccesses(TTI)) {
assert(WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&		assert(WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&
"No decisions should have been taken at this point");		"No decisions should have been taken at this point");
// Note: There is no need to invalidate any cost modeling decisions here, as		// Note: There is no need to invalidate any cost modeling decisions here, as
// non where taken so far.		// non where taken so far.
InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();		InterleaveInfo.invalidateGroupsRequiringScalarEpilogue();
}		}

unsigned MaxVF = UserVF ? UserVF : computeFeasibleMaxVF(TC);		unsigned MaxVF = UserVF ? UserVF : computeFeasibleMaxVF(TC);
		AyalUnsubmitted Not Done Reply Inline Actions The original motivation to assert that MaxVF is a power of 2, in D80491, was for concluding that any chosen VF will divide a constant trip count. Perhaps a better fix than D80491 would be (have been ..., as noted in the summary) to round MaxVF down to a power of 2 here, rather than forcing computeFeasibleMaxVF() and/or getSmallestAndWidestTypes() to return powers of 2. Ayal: The original motivation to assert that MaxVF is a power of 2, in D80491, was for concluding…
		bjopeUnsubmitted Not Done Reply Inline Actions Rounding down MaxVF returned by computeFeasibleMaxVF (maybe at the end of that function rather than here) would probably work better for our OOT target (with 160-bit register and i40 types, so neither the Smallest/WidestType nor the TTI.getRegisterBitWidth(true) returns a power-of-2 value in that situation). bjope: Rounding down MaxVF returned by computeFeasibleMaxVF (maybe at the end of that function rather…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Perhaps a better fix than D80491 would be (have been ..., as noted in the summary) to round MaxVF down to a power of 2 here, rather than forcing computeFeasibleMaxVF() and/or getSmallestAndWidestTypes() to return powers of 2. `computeFeasibleMaxVF` already has a comment indicating that the MaxVF should be a power-of-2 I think. I updated the patch to move the rounding down to computeFeasibleMaxVF, as other code might also implicitly rely on MaxVF being a power-of-2. It might be good to first ensure MaxVF to be a power of 2 in `computeFeasibleMaxVF` in the short term and then check if the other uses work as expected with non-power-of-2's subsequently, if required/desired. I think on most in-tree targets it is unlikely to have much impact, but it would be interesting to hear if the out-of-tree target @bjope mentioned actually supports non-power-of-2 vectorization factors (e.g. vector add of `5 x i40`)? fhahn: > Perhaps a better fix than D80491 would be (have been ..., as noted in the summary) to round…
		bjopeUnsubmitted Not Done Reply Inline Actions The VF is power-of-2 for the targets we support (but we got some registers and element sizes that aren't power-of-2). And this patch (as it landed) seem to solve the problems we had with D80491. Thanks! bjope: The VF is power-of-2 for the targets we support (but we got some registers and element sizes…
assert((UserVF \|\| isPowerOf2_32(MaxVF)) && "MaxVF must be a power of 2");		assert((UserVF \|\| isPowerOf2_32(MaxVF)) && "MaxVF must be a power of 2");
unsigned MaxVFtimesIC = UserIC ? MaxVF * UserIC : MaxVF;		unsigned MaxVFtimesIC = UserIC ? MaxVF * UserIC : MaxVF;
if (TC > 0 && TC % MaxVFtimesIC == 0) {		if (TC > 0 && TC % MaxVFtimesIC == 0) {
// Accept MaxVF if we do not have a tail.		// Accept MaxVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxVF;		return MaxVF;
}		}

Show All 33 Lines	LoopVectorizationCostModel::computeFeasibleMaxVF(unsigned ConstTripCount) {
// Get the maximum safe dependence distance in bits computed by LAA.		// Get the maximum safe dependence distance in bits computed by LAA.
// It is computed by MaxVF * sizeOf(type) * 8, where type is taken from		// It is computed by MaxVF * sizeOf(type) * 8, where type is taken from
// the memory accesses that is most restrictive (involved in the smallest		// the memory accesses that is most restrictive (involved in the smallest
// dependence distance).		// dependence distance).
unsigned MaxSafeRegisterWidth = Legal->getMaxSafeRegisterWidth();		unsigned MaxSafeRegisterWidth = Legal->getMaxSafeRegisterWidth();

WidestRegister = std::min(WidestRegister, MaxSafeRegisterWidth);		WidestRegister = std::min(WidestRegister, MaxSafeRegisterWidth);

// Ensure MaxVF is a power of 2; the dependence distance bound may not be.		// Ensure MaxVF is a power of 2; the dependence distance bound may not be.
		AyalUnsubmitted Done Reply Inline Actions Suggest to add to the comment that type sizes may also not be powers of 2. Ayal: Suggest to add to the comment that type sizes may also not be powers of 2.
WidestRegister = PowerOf2Floor(WidestRegister);		// Note that both WidestRegister and WidestType may not be a powers of 2.
		AyalUnsubmitted Done Reply Inline Actions Rounding WidestRegister down to a power of 2 is no longer needed; rounding-down MaxVectorSize alone suffices to ensure the method returns a power of 2. Ayal: Rounding WidestRegister down to a power of 2 is no longer needed; rounding-down MaxVectorSize…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Right, the only other use to compute the loop bound below should be fine. fhahn: Right, the only other use to compute the loop bound below should be fine.
		unsigned MaxVectorSize = PowerOf2Floor(WidestRegister / WidestType);
unsigned MaxVectorSize = WidestRegister / WidestType;

LLVM_DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType		LLVM_DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType
<< " / " << WidestType << " bits.\n");		<< " / " << WidestType << " bits.\n");
LLVM_DEBUG(dbgs() << "LV: The Widest register safe to use is: "		LLVM_DEBUG(dbgs() << "LV: The Widest register safe to use is: "
<< WidestRegister << " bits.\n");		<< WidestRegister << " bits.\n");

assert(MaxVectorSize <= 256 && "Did not expect to pack so many elements"		assert(MaxVectorSize <= 256 && "Did not expect to pack so many elements"
" into one vector!");		" into one vector!");
if (MaxVectorSize == 0) {		if (MaxVectorSize == 0) {
LLVM_DEBUG(dbgs() << "LV: The target has no vector registers.\n");		LLVM_DEBUG(dbgs() << "LV: The target has no vector registers.\n");
MaxVectorSize = 1;		MaxVectorSize = 1;
return MaxVectorSize;		return MaxVectorSize;
} else if (ConstTripCount && ConstTripCount < MaxVectorSize &&		} else if (ConstTripCount && ConstTripCount < MaxVectorSize &&
isPowerOf2_32(ConstTripCount)) {		isPowerOf2_32(ConstTripCount)) {
		AyalUnsubmitted Not Done Reply Inline Actions Unrelated: makes sense to limit MaxVF to ConstTripCount also if the latter is not a power of 2, by rounding it down to a power of 2 (or up, with fold-tail?). Ayal: Unrelated: makes sense to limit MaxVF to ConstTripCount also if the latter is not a power of 2…
// We need to clamp the VF to be the ConstTripCount. There is no point in		// We need to clamp the VF to be the ConstTripCount. There is no point in
// choosing a higher viable VF as done in the loop below.		// choosing a higher viable VF as done in the loop below.
LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to the constant trip count: "		LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to the constant trip count: "
<< ConstTripCount << "\n");		<< ConstTripCount << "\n");
MaxVectorSize = ConstTripCount;		MaxVectorSize = ConstTripCount;
return MaxVectorSize;		return MaxVectorSize;
}		}

Show All 23 Lines	for (int i = RUs.size() - 1; i >= 0; --i) {
MaxVF = VFs[i];		MaxVF = VFs[i];
break;		break;
}		}
}		}
if (unsigned MinVF = TTI.getMinimumVF(SmallestType)) {		if (unsigned MinVF = TTI.getMinimumVF(SmallestType)) {
if (MaxVF < MinVF) {		if (MaxVF < MinVF) {
LLVM_DEBUG(dbgs() << "LV: Overriding calculated MaxVF(" << MaxVF		LLVM_DEBUG(dbgs() << "LV: Overriding calculated MaxVF(" << MaxVF
<< ") with target's minimum: " << MinVF << '\n');		<< ") with target's minimum: " << MinVF << '\n');
MaxVF = MinVF;		MaxVF = MinVF;
		AyalUnsubmitted Not Done Reply Inline Actions Unrelated: getMinimumVF() must also be (asserted to be) a power of 2; moreover, it must not cause MaxVF to exceed `MaxSafeRegisterWidth`. Ayal: Unrelated: getMinimumVF() must also be (asserted to be) a power of 2; moreover, it must not…
}		}
}		}
}		}
return MaxVF;		return MaxVF;
		AyalUnsubmitted Done Reply Inline Actions This should be redundant; i.e., at this point MaxVF can be asserted to be a power of 2. Only way it might not be is because of getMinimumVF(), which, being a minimum, should probably not be rounded down silently (see above comment). Ayal: This should be redundant; i.e., at this point MaxVF can be asserted to be a power of 2. Only…
		fhahnAuthorUnsubmitted Done Reply Inline Actions This should be redundant; i.e., at this point MaxVF can be asserted to be a power of 2. Only way it might not be is because of getMinimumVF(), which, being a minimum, should probably not be rounded down silently (see above comment). Right. Otherwise I am sure some test case will surface, which can then be added :) fhahn: > This should be redundant; i.e., at this point MaxVF can be asserted to be a power of 2. Only…
}		}

VectorizationFactor		VectorizationFactor
LoopVectorizationCostModel::selectVectorizationFactor(unsigned MaxVF) {		LoopVectorizationCostModel::selectVectorizationFactor(unsigned MaxVF) {
float Cost = expectedCost(1).first;		float Cost = expectedCost(1).first;
const float ScalarCost = Cost;		const float ScalarCost = Cost;
unsigned Width = 1;		unsigned Width = 1;
LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");		LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	for (Instruction &I : BB->instructionsWithoutDebug()) {

MinWidth = std::min(MinWidth,		MinWidth = std::min(MinWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
MaxWidth = std::max(MaxWidth,		MaxWidth = std::max(MaxWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
}		}
}		}

return {MinWidth, MaxWidth};		return {MinWidth, MaxWidth};
		AyalUnsubmitted Not Done Reply Inline Actions The critical part is to make sure VF does not exceed `MaxSafeRegisterWidth`; this indeed holds when rounding `MinWidth` and `MaxWidth` up to powers of 2; might not hold when rounding them down, conceptually. But perhaps non-power-of-2 sized types should be skipped when computing MinWidth and MaxWidth, related to the last "also" part of the FIXME above? cf. `hasIrregularType()`. Ayal: The critical part is to make sure VF does not exceed `MaxSafeRegisterWidth`; this indeed holds…
		bjopeUnsubmitted Not Done Reply Inline Actions I did play around with ignoring "irregular" types here a few days ago. In most situations I think it is ok. But I also noticed that it is possible to for example get <4 x i15> in the vector body, with a loop doing `store i15 7, i15* ptr, align 2`. So ignoring the type when determining Smallest/Widest type does not completely prevent us from creating vectors with irregular types (as it seems). And that vector could be larger than the max register width. Although, considering that the resulting <4 x i15> vector store in my test will use a packed layout, that case actually looks like a bug (I'll probably end up writing a PR about it). bjope: I did play around with ignoring "irregular" types here a few days ago. In most situations I…
}		}

unsigned LoopVectorizationCostModel::selectInterleaveCount(unsigned VF,		unsigned LoopVectorizationCostModel::selectInterleaveCount(unsigned VF,
unsigned LoopCost) {		unsigned LoopCost) {
// -- The interleave heuristics --		// -- The interleave heuristics --
// We interleave the loop in order to expose ILP and reduce the loop overhead.		// We interleave the loop in order to expose ILP and reduce the loop overhead.
// There are many micro-architectural considerations that we can't predict		// There are many micro-architectural considerations that we can't predict
// at this level. For example, frontend pressure (on decode or fetch) due to		// at this level. For example, frontend pressure (on decode or fetch) due to
▲ Show 20 Lines • Show All 2,867 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/fp80-widest-type.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -loop-vectorize -S %s -mattr=+avx512f \| FileCheck %s

				target datalayout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-macosx10.15.0"

				; Make sure non-power-of-2 types are handled correctly, i.e., MaxVF is still a power-of-2.

				AyalUnsubmitted Done Reply Inline Actions Comment should be updated. E.g., ";Make sure non-power-of-2 types are handled correctly, i.e., MaxVF is still a power-of-2." Ayal: Comment should be updated. E.g., ";Make sure non-power-of-2 types are handled correctly, i.e.
				; Test case from PR46139.

				AyalUnsubmitted Not Done Reply Inline Actions Worth indicating that this fixes PR46139; e.g., in a comment, test name, file name. Ayal: Worth indicating that this fixes PR46139; e.g., in a comment, test name, file name.
				define x86_fp80 @test() {
				; CHECK-LABEL: @test(
				; CHECK-NEXT: foo.exit:
				; CHECK-NEXT: br label [[FOR_BODY3_I_3:%.*]]
				; CHECK: for.body3.i.3:
				; CHECK-NEXT: [[N_ADDR_112_I_3:%.]] = phi i64 [ [[DEC_I_3:%.]], [[FOR_BODY3_I_3]] ], [ 24, [[FOO_EXIT:%.*]] ]
				; CHECK-NEXT: [[X_ADDR_111_I_3:%.]] = phi x86_fp80 [ [[MUL_I_3:%.]], [[FOR_BODY3_I_3]] ], [ undef, [[FOO_EXIT]] ]
				; CHECK-NEXT: [[MUL_I_3]] = fmul x86_fp80 [[X_ADDR_111_I_3]], 0xK40008000000000000000
				; CHECK-NEXT: [[DEC_I_3]] = add nsw i64 [[N_ADDR_112_I_3]], -1
				; CHECK-NEXT: [[CMP2_I_3:%.*]] = icmp sgt i64 [[N_ADDR_112_I_3]], 1
				; CHECK-NEXT: br i1 [[CMP2_I_3]], label [[FOR_BODY3_I_3]], label [[FOO_EXIT_3:%.*]]
				; CHECK: foo.exit.3:
				; CHECK-NEXT: [[MUL_LCSSA:%.*]] = phi x86_fp80 [ [[MUL_I_3]], [[FOR_BODY3_I_3]] ]
				; CHECK-NEXT: ret x86_fp80 [[MUL_LCSSA]]
				;
				foo.exit:
				br label %for.body3.i.3

				for.body3.i.3: ; preds = %for.body3.i.3, %foo.exit
				%n.addr.112.i.3 = phi i64 [ %dec.i.3, %for.body3.i.3 ], [ 24, %foo.exit ]
				%x.addr.111.i.3 = phi x86_fp80 [ %mul.i.3, %for.body3.i.3 ], [ undef, %foo.exit ]
				%mul.i.3 = fmul x86_fp80 %x.addr.111.i.3, 0xK40008000000000000000
				%dec.i.3 = add nsw i64 %n.addr.112.i.3, -1
				%cmp2.i.3 = icmp sgt i64 %n.addr.112.i.3, 1
				br i1 %cmp2.i.3, label %for.body3.i.3, label %foo.exit.3

				foo.exit.3: ; preds = %for.body3.i.3
				%mul.lcssa = phi x86_fp80 [ %mul.i.3, %for.body3.i.3 ]
				ret x86_fp80 %mul.lcssa
				}