Download Raw Diff

Details

Reviewers

hsaito
Ayal
fhahn
reames

Commits

rG18824d25d8aa: [LV] Interleaving should not exceed estimated loop trip count.

Summary

Currently we may do iterleaving by more than estimated trip count
coming from the profile or computed maximum trip count. The solution is to
use "best known" trip count instead of exact one in interleaving analysis.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ebrevnov created this revision.Sep 23 2019, 11:17 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2019, 11:17 PM

Herald added subscribers: llvm-commits, rkruppe, hiraditya. · View Herald Transcript

Harbormaster completed remote builds in B38462: Diff 221470.Sep 23 2019, 11:17 PM

ebrevnov added a parent revision: D67690: [LV][NFC] Factor out calculation of "best" estimated trip count..Sep 23 2019, 11:25 PM

ebrevnov added reviewers: hsaito, Ayal, fhahn, reames.

ebrevnov marked an inline comment as done.Sep 24 2019, 4:41 AM

ebrevnov added inline comments.

llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll
6	according to profile original loop has 99 iterations thus interleaving is disabled by short trip count heuristics controlled by tiny-trip-count-interleave-threshold

rscottmanley added a subscriber: rscottmanley.Sep 24 2019, 6:12 AM

This looks reasonable to me, but I'm not a qualified reviewer in this area.

llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
1 ↗	(On Diff #221470)	I would suggest using an autogenerate checks for these. It'll be more verbose, but also more clear about the expected output. See utils/update_test_checks.py and FileCheck's --check-prefix option for the three variations.
65 ↗	(On Diff #221470)	Please remove unnecessary aspects of test.

ebrevnov marked an inline comment as done.Sep 26 2019, 9:39 PM

ebrevnov added inline comments.

llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
1 ↗	(On Diff #221470)	See utils/update_test_checks.py and FileCheck's --check-prefix option for the three variations. Not sure I understand what exactly you mean but this. I know it is possible to have a dedicated prefix for common checks of two run lines and provide several prefixes to FileCheck. For example: ; RUN: opt <cmd1> \| FileCheck --check-prefix=CHECK-COMMON,CHECK-SPECIFIC1 ; RUN: opt <cmd2> \| FileCheck --check-prefix=CHECK-COMMON,CHECK-SPECIFIC2 In my case I have three different runs with one common check. Is there a way to combine them?

Minor test update

ebrevnov marked 2 inline comments as done and an inline comment as not done.Sep 27 2019, 3:36 AM

ebrevnov added inline comments.

llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
1 ↗	(On Diff #221470)	I would suggest using an autogenerate checks for these. It'll be more verbose, but also more clear about the expected output. In fact I did use update_test_checks.py to generate initial checks and then manually removed all unrelated ones. I think that makes test less sensitive to side changes and it's easy to see what we actually care about. On the other hand I do see benefit of having wider context without the need to run 'opt' by hands. It would be interesting to hear what others think in this regard.
65 ↗	(On Diff #221470)	Done

xbolva00 added a subscriber: xbolva00.Sep 27 2019, 3:38 AM

xbolva00 added inline comments.

llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
65 ↗	(On Diff #221470)	+1 Just leave metadata which you really need for this test. (Lines 56-70 can be removed)

Vectorizer code change looks fine with me. I'd like to see the comments updated, though. Any more changes needed for the LIT tests?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5146–5147	small estimated or constant trip count
5212–5213	If trip count is expected to be small, limit the interleave count to be less than the trip count divided by VF

hsaito added inline comments.Oct 4 2019, 4:40 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5147–5150	This assumes constant trip count case is handled well by getSmallConstantMaxTripCount called from getSmallBestKnownTC ---- but if that is not the case, that would be a bug on the SCEV side. I traced a little bit but could not verify it myself as I'm not familiar with SCEV code. As such, I'm just pointing out a different SCEV function will be called as a result of this change.

In D67948#1695326, @hsaito wrote:

Vectorizer code change looks fine with me. I'd like to see the comments updated, though. Any more changes needed for the LIT tests?

LGTM. Please wait a few more days to give others a chance for another look.

This revision is now accepted and ready to land.Oct 7 2019, 11:32 AM

Minor changes requested by reviewers

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5146–5147	Done
5147–5150	Maybe I'm missing your concern but constant trip count case is processed first in getSmallBestKnownTC by a call to getSmallConstantTripCount. Thus I don't see any change for constant trip count case at all.
5212–5213	There is some ambiguity in using "small" through out the code. For getSmallBestKnownTC "small" is if it fits 32-bit. For "if (BestKnownTC && *BestKnownTC < TinyTripCountInterleaveThreshold)" check "small" is what less than TinyTripCountInterleaveThreshold. Here "small" should refer to the meaning defined by getSmallBestKnownTC . I think we better avoid using "small" one more time here to minimize the confusion.
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
65 ↗	(On Diff #221470)	I tried to remove as much as I can but not all of them can be actually removed.

@hsaito, Please commit if you find this version acceptable.

In D67948#1710615, @ebrevnov wrote:

@hsaito, Please commit if you find this version acceptable.

@ebrevnov, I recommend you visit http://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access and obtain your own commit access. Your committed and under review patches deserve it.
Thank you very much for your contribution. Let me know if you still would like me to commit this one.

hsaito added inline comments.Oct 16 2019, 10:13 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5212–5213	Sorry for being unclear. I was suggesting an update to the comment. With this patch, BestKnownTC is not constant, right?

In D67948#1711380, @hsaito wrote:

In D67948#1710615, @ebrevnov wrote:

@hsaito, Please commit if you find this version acceptable.

@ebrevnov, I recommend you visit http://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access and obtain your own commit access. Your committed and under review patches deserve it.
Thank you very much for your contribution. Let me know if you still would like me to commit this one.

I was going to do that right after this patch lands. Please assist me (hopefully last time) in landing the patch.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5212–5213	I see what you are talking about. Another ambiguity here :-) BestKnownTC returns compile time constant value which may be exact runtime constant or estimated non-constant. In this case I believe "constant" means that we were able to get a compile time constant value for the trip count. How about the following wording? // If trip count is known or estimated compile time constant, limit ....

In D67948#1712213, @ebrevnov wrote:

In D67948#1711380, @hsaito wrote:

In D67948#1710615, @ebrevnov wrote:

@hsaito, Please commit if you find this version acceptable.

@ebrevnov, I recommend you visit http://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access and obtain your own commit access. Your committed and under review patches deserve it.
Thank you very much for your contribution. Let me know if you still would like me to commit this one.

I was going to do that right after this patch lands. Please assist me (hopefully last time) in landing the patch.

OK. Please update the comment and then I'll land the patch.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5212–5213	That's fine as well.

Updated comment as agreed.

In D67948#1717671, @ebrevnov wrote:

Updated comment as agreed.

Got it. First time committing through git (SVN is read-only now). Expecting some learning curve. If you'd like this in sooner, you might want to ask someone familiar with the process. Will see if I hit any issues.

In D67948#1722099, @hsaito wrote:

In D67948#1717671, @ebrevnov wrote:

Updated comment as agreed.

Got it. First time committing through git (SVN is read-only now). Expecting some learning curve. If you'd like this in sooner, you might want to ask someone familiar with the process. Will see if I hit any issues.

Needed to have done this process before SVN went readonly. Please ask @reames to commit. He must be better prepared. Sorry about that.
https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access-to-the-github-repository

In D67948#1722120, @hsaito wrote:

In D67948#1722099, @hsaito wrote:

In D67948#1717671, @ebrevnov wrote:

Updated comment as agreed.

Got it. First time committing through git (SVN is read-only now). Expecting some learning curve. If you'd like this in sooner, you might want to ask someone familiar with the process. Will see if I hit any issues.

Needed to have done this process before SVN went readonly. Please ask @reames to commit. He must be better prepared. Sorry about that.
https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access-to-the-github-repository

@craig.topper agreed to commit this. Should happen today. Sorry for the mess.

Closed by commit rG18824d25d8aa: [LV] Interleaving should not exceed estimated loop trip count. (authored by craig.topper). · Explain WhyOct 28 2019, 10:59 AM

This revision was automatically updated to reflect the committed changes.

@hsaito & @craig.topper appreciated a lot!

craig.topper mentioned this in rGf8ba90d448c6: [LV] Add test case that was supposed to go with D67948.Oct 31 2019, 3:21 PM

I failed to commit interleave_short_tc.ll initially. I fixed that yesterday, but then had to move it into the X86 test directory to appease the build bots.

We have observed some performance regressions, presumably because the vectorized code started to kick-in on short estimated trip count loops (as opposed to skipping vector code and execute scalar code). We'll try following up with cost model tuning. I'm not too surprised if others also hit a similar issue. Overall, though, that's the right direction to head to.

snidertm added a subscriber: snidertm.Nov 7 2019, 6:50 AM

AaronLiu added a subscriber: AaronLiu.Jun 8 2020, 11:20 PM

This comment was removed by AaronLiu.

Accidentally removed the message that I posted above. Re-post here: basically what I want say is to request reviewers for this patch to review another patch D81416 that touch the same file. Thanks!

In D67948#1731021, @hsaito wrote:

We have observed some performance regressions, presumably because the vectorized code started to kick-in on short estimated trip count loops (as opposed to skipping vector code and execute scalar code). We'll try following up with cost model tuning. I'm not too surprised if others also hit a similar issue. Overall, though, that's the right direction to head to.

This was fixed?

In D67948#2083216, @xbolva00 wrote:

In D67948#1731021, @hsaito wrote:

We have observed some performance regressions, presumably because the vectorized code started to kick-in on short estimated trip count loops (as opposed to skipping vector code and execute scalar code). We'll try following up with cost model tuning. I'm not too surprised if others also hit a similar issue. Overall, though, that's the right direction to head to.

This was fixed?

From my testing, one of the bmk degraded 20+% at peak last year, but now with the patch in D81416 it is confirmed that it can get 50+% performance gain.

Wow, significant improvement!

And please post these benchmark results in your patch too.

I test it on PowerPC, I request it to be tested by the community on different platform. Currently it is disabled by default, and will enable it in the next step.

Diff 226707

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 194 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableInterleavedMemAccesses(
cl::desc("Enable vectorization on interleaved memory accesses in a loop"));		cl::desc("Enable vectorization on interleaved memory accesses in a loop"));

/// An interleave-group may need masking if it resides in a block that needs		/// An interleave-group may need masking if it resides in a block that needs
/// predication, or in order to mask away gaps.		/// predication, or in order to mask away gaps.
static cl::opt<bool> EnableMaskedInterleavedMemAccesses(		static cl::opt<bool> EnableMaskedInterleavedMemAccesses(
"enable-masked-interleaved-mem-accesses", cl::init(false), cl::Hidden,		"enable-masked-interleaved-mem-accesses", cl::init(false), cl::Hidden,
cl::desc("Enable vectorization on masked interleaved memory accesses in a loop"));		cl::desc("Enable vectorization on masked interleaved memory accesses in a loop"));

/// We don't interleave loops with a known constant trip count below this		static cl::opt<unsigned> TinyTripCountInterleaveThreshold(
/// number.		"tiny-trip-count-interleave-threshold", cl::init(128), cl::Hidden,
static const unsigned TinyTripCountInterleaveThreshold = 128;		cl::desc("We don't interleave loops with a estimated constant trip count "
		"below this number"));

static cl::opt<unsigned> ForceTargetNumScalarRegs(		static cl::opt<unsigned> ForceTargetNumScalarRegs(
"force-target-num-scalar-regs", cl::init(0), cl::Hidden,		"force-target-num-scalar-regs", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's number of scalar registers."));		cl::desc("A flag that overrides the target's number of scalar registers."));

static cl::opt<unsigned> ForceTargetNumVectorRegs(		static cl::opt<unsigned> ForceTargetNumVectorRegs(
"force-target-num-vector-regs", cl::init(0), cl::Hidden,		"force-target-num-vector-regs", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's number of vector registers."));		cl::desc("A flag that overrides the target's number of vector registers."));
▲ Show 20 Lines • Show All 4,923 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(unsigned VF,
// due to the increased register pressure.		// due to the increased register pressure.

if (!isScalarEpilogueAllowed())		if (!isScalarEpilogueAllowed())
return 1;		return 1;

// We used the distance for the interleave count.		// We used the distance for the interleave count.
if (Legal->getMaxSafeDepDistBytes() != -1U)		if (Legal->getMaxSafeDepDistBytes() != -1U)
return 1;		return 1;

// Do not interleave loops with a relatively small trip count.		// Do not interleave loops with a relatively small known or estimated trip
		hsaitoUnsubmitted Not Done Reply Inline Actions small estimated or constant trip count hsaito: small estimated or constant trip count
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions Done ebrevnov: Done
unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);		// count.
if (TC > 1 && TC < TinyTripCountInterleaveThreshold)		auto BestKnownTC = getSmallBestKnownTC(*PSE.getSE(), TheLoop);
		if (BestKnownTC && *BestKnownTC < TinyTripCountInterleaveThreshold)
		hsaitoUnsubmitted Not Done Reply Inline Actions This assumes constant trip count case is handled well by getSmallConstantMaxTripCount called from getSmallBestKnownTC ---- but if that is not the case, that would be a bug on the SCEV side. I traced a little bit but could not verify it myself as I'm not familiar with SCEV code. As such, I'm just pointing out a different SCEV function will be called as a result of this change. hsaito: This assumes constant trip count case is handled well by getSmallConstantMaxTripCount called…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions Maybe I'm missing your concern but constant trip count case is processed first in getSmallBestKnownTC by a call to getSmallConstantTripCount. Thus I don't see any change for constant trip count case at all. ebrevnov: Maybe I'm missing your concern but constant trip count case is processed first in…
return 1;		return 1;

RegisterUsage R = calculateRegisterUsage({VF})[0];		RegisterUsage R = calculateRegisterUsage({VF})[0];
// We divide by these constants so assume that we have at least one		// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.		// instruction that uses at least one register.
for (auto& pair : R.MaxLocalUsers) {		for (auto& pair : R.MaxLocalUsers) {
pair.second = std::max(pair.second, 1U);		pair.second = std::max(pair.second, 1U);
}		}
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(unsigned VF,
// Check if the user has overridden the max.		// Check if the user has overridden the max.
if (VF == 1) {		if (VF == 1) {
if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0)		if (ForceTargetMaxScalarInterleaveFactor.getNumOccurrences() > 0)
MaxInterleaveCount = ForceTargetMaxScalarInterleaveFactor;		MaxInterleaveCount = ForceTargetMaxScalarInterleaveFactor;
} else {		} else {
if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0)		if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0)
MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;		MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;
}		}

// If the trip count is constant, limit the interleave count to be less than		// If trip count is known or estimated compile time constant, limit the
		hsaitoUnsubmitted Not Done Reply Inline Actions If trip count is expected to be small, limit the interleave count to be less than the trip count divided by VF hsaito: If trip count is expected to be small, limit the interleave count to be less than the trip…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions There is some ambiguity in using "small" through out the code. For getSmallBestKnownTC "small" is if it fits 32-bit. For "if (BestKnownTC && BestKnownTC < TinyTripCountInterleaveThreshold)" check "small" is what less than TinyTripCountInterleaveThreshold. Here "small" should refer to the meaning defined by getSmallBestKnownTC . I think we better avoid using "small" one more time here to minimize the confusion. ebrevnov:* There is some ambiguity in using "small" through out the code. For getSmallBestKnownTC "small"…
		hsaitoUnsubmitted Not Done Reply Inline Actions Sorry for being unclear. I was suggesting an update to the comment. With this patch, BestKnownTC is not constant, right? hsaito: Sorry for being unclear. I was suggesting an update to the comment. With this patch…
		ebrevnovAuthorUnsubmitted Done Reply Inline Actions I see what you are talking about. Another ambiguity here :-) BestKnownTC returns compile time constant value which may be exact runtime constant or estimated non-constant. In this case I believe "constant" means that we were able to get a compile time constant value for the trip count. How about the following wording? // If trip count is known or estimated compile time constant, limit .... ebrevnov: I see what you are talking about. Another ambiguity here :-) BestKnownTC returns compile time…
		hsaitoUnsubmitted Not Done Reply Inline Actions That's fine as well. hsaito: That's fine as well.
// the trip count divided by VF.		// interleave count to be less than the trip count divided by VF.
if (TC > 0) {		if (BestKnownTC) {
assert(TC >= VF && "VF exceeds trip count?");		MaxInterleaveCount = std::min(*BestKnownTC / VF, MaxInterleaveCount);
if ((TC / VF) < MaxInterleaveCount)
MaxInterleaveCount = (TC / VF);
}		}

// If we did not calculate the cost for VF (because the user selected the VF)		// If we did not calculate the cost for VF (because the user selected the VF)
// then we calculate the cost of VF here.		// then we calculate the cost of VF here.
if (LoopCost == 0)		if (LoopCost == 0)
LoopCost = expectedCost(VF).first;		LoopCost = expectedCost(VF).first;

assert(LoopCost && "Non-zero loop cost expected");		assert(LoopCost && "Non-zero loop cost expected");
▲ Show 20 Lines • Show All 2,686 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll

	; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks=loop-vectorize -pass-remarks-missed=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-with-hotness 2>&1 \| FileCheck %s			; RUN: opt < %s -loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks=loop-vectorize -pass-remarks-missed=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-with-hotness 2>&1 \| FileCheck %s
	; RUN: opt < %s -passes=loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks=loop-vectorize -pass-remarks-missed=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-with-hotness 2>&1 \| FileCheck %s			; RUN: opt < %s -passes=loop-vectorize -mtriple=x86_64-unknown-linux -S -pass-remarks=loop-vectorize -pass-remarks-missed=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-with-hotness 2>&1 \| FileCheck %s

	; CHECK: remark: no_fpmath.c:6:11: loop not vectorized: cannot prove it is safe to reorder floating-point operations (hotness: 300)			; CHECK: remark: no_fpmath.c:6:11: loop not vectorized: cannot prove it is safe to reorder floating-point operations (hotness: 300)
	; CHECK: remark: no_fpmath.c:6:14: loop not vectorized			; CHECK: remark: no_fpmath.c:6:14: loop not vectorized
	; CHECK: remark: no_fpmath.c:17:14: vectorized loop (vectorization width: 2, interleaved count: 2) (hotness: 300)			; CHECK: remark: no_fpmath.c:17:14: vectorized loop (vectorization width: 2, interleaved count: 1) (hotness: 300)
				ebrevnovAuthorUnsubmitted Done Reply Inline Actions according to profile original loop has 99 iterations thus interleaving is disabled by short trip count heuristics controlled by tiny-trip-count-interleave-threshold ebrevnov: according to profile original loop has 99 iterations thus interleaving is disabled by short…

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.10.0"			target triple = "x86_64-apple-macosx10.10.0"

	; Function Attrs: nounwind readonly ssp uwtable			; Function Attrs: nounwind readonly ssp uwtable
	define double @cond_sum(i32* nocapture readonly %v, i32 %n) #0 !dbg !4 !prof !29 {			define double @cond_sum(i32* nocapture readonly %v, i32 %n) #0 !dbg !4 !prof !29 {
	entry:			entry:
	%cmp.7 = icmp sgt i32 %n, 0, !dbg !3			%cmp.7 = icmp sgt i32 %n, 0, !dbg !3
	▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Interleaving should not exceed estimated loop trip count.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 226707

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Interleaving should not exceed estimated loop trip count.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 226707

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll

[LV] Interleaving should not exceed estimated loop trip count.
ClosedPublic