This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
vector_max_bandwidth.ll
-
vector_ptr_load_store.ll

Differential D8943

Calculate vectorization factor using the narrowest type instead of widest type
ClosedPublic

Authored by congh on Apr 9 2015, 5:22 PM.

Download Raw Diff

Details

Reviewers

chandlerc
aschwaighofer
davidxl
hfinkel

Commits

rGcf2ed26836bf: Add a flag vectorizer-maximize-bandwidth in loop vectorizer to enable using…
rL251850: Add a flag vectorizer-maximize-bandwidth in loop vectorizer to enable using…

Summary

To be able to maximize the bandwidth during vectorization, this update provides a new option vectorizer-maximize-bandwidth. When it is turned on, the vectorizer will determine the vectorization factor (VF) using the smallest instead of widest type in the loop. To avoid increasing register pressure too much, estimates of the register usage for different VFs are calculated so that we only choose a VF when its register usage doesn't exceed the number of available registers.

Diff Detail

Repository: rL LLVM

Event Timeline

congh updated this revision to Diff 23545.Apr 9 2015, 5:22 PM

congh retitled this revision from to Calculate vectorization factor using the narrowest type instead of widest type.

congh updated this object.

congh edited the test plan for this revision. (Show Details)

congh added a reviewer: hfinkel.

congh added a subscriber: Unknown Object (MLST).

hfinkel added a reviewer: aschwaighofer.Apr 11 2015, 6:33 PM

hfinkel added a reviewer: chandlerc.

hfinkel added a subscriber: nadav.

[+Arnold, Nadav,Chandler]

If I understand this correctly, this will cause us to potentially generate wider vectors than we have underlying vector registers, and I think that, generically, this makes sense. Now that our X86 shuffle handling is sane, the splitting of wide vectors, and shuffling that you get from vector extends/truncates is hopefully not too bad. Other opinions?

Did you see any performance changes on the test suite?

We might need to update the register-pressure heuristic (LoopVectorizationCostModel::calculateRegisterUsage()) to understand that very-wide vectors use multiple vector registers.

spatel added a subscriber: spatel.Apr 13 2015, 9:30 PM

In D8943#155214, @hfinkel wrote:

[+Arnold, Nadav,Chandler]

If I understand this correctly, this will cause us to potentially generate wider vectors than we have underlying vector registers, and I think that, generically, this makes sense. Now that our X86 shuffle handling is sane, the splitting of wide vectors, and shuffling that you get from vector extends/truncates is hopefully not too bad. Other opinions?

I generally agree.

The key is that we should maximize the load/store bandwidth provided we have sufficient registers (see below).

Did you see any performance changes on the test suite?

We might need to update the register-pressure heuristic (LoopVectorizationCostModel::calculateRegisterUsage()) to understand that very-wide vectors use multiple vector registers.

Yes, I think that is going to be the important to watch the register pressure heuristics.

Just updating this revision as it seems a bit stalled. I think there are a few things going on here...

I think it would be good to first at least mostly address the problem of identifying places where we can hoist truncs to narrow the width ad which we're doing operations within the vector. Without this, I think measuring the performance impact of this change will be hard -- we'll see wins that could be realized with a less register pressure intensive change.

I think this needs some more high-level tests -- we should actually add a loop test case that should vectorize differently as a consequence.

The fp64_to_uint32-cost-model.ll change seems odd -- either the update to the test or the comments in the test are wrong... Don't know which.

I think we would need numbers on non-x86 architectures in order to be confident that the register pressure increase wasn't problematic. This might mean using a temporary debug flag to enable this until we can hear back from other backend maintainers. I don't imagine any of the backends outside of ARM, x86, and PPC have enough autovectorization users to really care, so it shouldn't be too bad.

This revision now requires changes to proceed.May 27 2015, 2:14 AM

Update the patch.

congh updated this object.Sep 21 2015, 4:54 PM

congh added a reviewer: davidxl.

I am so sorry for being away from this patch for so long time!

In D8943#179239, @chandlerc wrote:

Just updating this revision as it seems a bit stalled. I think there are a few things going on here...

I think it would be good to first at least mostly address the problem of identifying places where we can hoist truncs to narrow the width ad which we're doing operations within the vector. Without this, I think measuring the performance impact of this change will be hard -- we'll see wins that could be realized with a less register pressure intensive change.

In the updated patch I estimated the register usage of larger VFs to ensure that it isn't too large. This should help to reduce register pressure.

I think this needs some more high-level tests -- we should actually add a loop test case that should vectorize differently as a consequence.

I added a test case to this patch for larger VF. However, I found the cost estimation of many operations when VF is large is inaccurate at least for X86, and so that the maximum VF could not be applied due to the unreasonable large cost. I will fix those issues later and add more test cases.

The fp64_to_uint32-cost-model.ll change seems odd -- either the update to the test or the comments in the test are wrong... Don't know which.

As the change is protected by an option, many updates to test cases are not necessary now.

I think we would need numbers on non-x86 architectures in order to be confident that the register pressure increase wasn't problematic. This might mean using a temporary debug flag to enable this until we can hear back from other backend maintainers. I don't imagine any of the backends outside of ARM, x86, and PPC have enough autovectorization users to really care, so it shouldn't be too bad.

Following you suggestion, I have added an option in this patch. The register pressure problem is also alleviated as described above.

Have you run LLVM's test suite with this turned on? Are there any significant performance changes? [I'm happy for this to go in, given that it's disabled by default, even if there are regressions to fix, but I'd like to know where we stand].

lib/Transforms/Vectorize/LoopVectorize.cpp
4585 ↗	(On Diff #35321)	I'd make this 8 instead of 4 (we might have 7 VF for 8-bit integers in AVX-512, for example).

Fix a test failure (a potential bug in LLVM) when the new flag is turned on by default.

In D8943#252089, @hfinkel wrote:

Have you run LLVM's test suite with this turned on? Are there any significant performance changes? [I'm happy for this to go in, given that it's disabled by default, even if there are regressions to fix, but I'd like to know where we stand].

.

I ran the regression tests and only three failed, one of which (assertion fail) is already fixed in the updated patch, and the other two are caused by larger VF and are understood.

The performance test is still running and I will give the result later.

In D8943#252089, @hfinkel wrote:

Have you run LLVM's test suite with this turned on? Are there any significant performance changes? [I'm happy for this to go in, given that it's disabled by default, even if there are regressions to fix, but I'd like to know where we stand].

.

For all 498 test cases in the test suite, the average speed-up is 6.66% with this patch. There are several significant performance changes (>100%) and most are positive. I will keep investigating those significant negative changes.

Update the patch according to hfinkel's comment.

Fix a default flag.

hfinkel, is this patch ready to be checked in?

vecfactor - perf.csv10 KBDownload

I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).

I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.

Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?

In D8943#260101, @spatel wrote:

vecfactor - perf.csv10 KBDownload

I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).

I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.

Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?

Thank you for the performance test! I think there may be two reasons that why we could not observe big performance difference in llvm test suite:

There is no hotspot that includes a loop with types of different sizes (this is what this patch is optimizing).
There are some problems with the cost model in llvm. Even we can choose a larger VF, the cost model shows that the larger VF has the larger cost. I will deal with this issue later.

I don't have a test in my codebase that benefits from this patch, but it is quite easy to synthesize one:

const int N = 1024 * 32;
int a[N];
char b[N];

int main() {

for (int i = 0; i < N; ++i) {
  for (int i = 0; i < N; ++i) {
    a[i]++;
    b[i]++;
  }
}

}

For the code shown above, the original running time is ~0.35s and with this patch the running time is reduced to ~0.228s.

In D8943#266404, @congh wrote:

In D8943#260101, @spatel wrote:

vecfactor - perf.csv10 KBDownload

I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).

I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.

Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?

Thank you for the performance test! I think there may be two reasons that why we could not observe big performance difference in llvm test suite:

There is no hotspot that includes a loop with types of different sizes (this is what this patch is optimizing).

There are some problems with the cost model in llvm. Even we can choose a larger VF, the cost model shows that the larger VF has the larger cost. I will deal with this issue later.

I don't have a test in my codebase that benefits from this patch, but it is quite easy to synthesize one:

Thanks, Cong. I am confused by the above statement versus your earlier one:
"For all 498 test cases in the test suite, the average speed-up is 6.66% with this patch. There are several significant performance changes (>100%) and most are positive. I will keep investigating those significant negative changes."

Did something in the patch or external code change such that there used to be significant performance changes but now there are not?

In D8943#267443, @spatel wrote:

In D8943#266404, @congh wrote:

In D8943#260101, @spatel wrote:

vecfactor - perf.csv10 KBDownload

I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).

I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.

Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?

Thank you for the performance test! I think there may be two reasons that why we could not observe big performance difference in llvm test suite:

There is no hotspot that includes a loop with types of different sizes (this is what this patch is optimizing).

There are some problems with the cost model in llvm. Even we can choose a larger VF, the cost model shows that the larger VF has the larger cost. I will deal with this issue later.

I don't have a test in my codebase that benefits from this patch, but it is quite easy to synthesize one:

Thanks, Cong. I am confused by the above statement versus your earlier one:
"For all 498 test cases in the test suite, the average speed-up is 6.66% with this patch. There are several significant performance changes (>100%) and most are positive. I will keep investigating those significant negative changes."

Did something in the patch or external code change such that there used to be significant performance changes but now there are not?

Sorry for forgetting to clarify on this. After investigating those test results, I found that many large numbers are due to flaky tests. Many tests enjoying 5+% speed-up even don't have any change to be vectorized. However, I think your test results make more sense. So do you want me to do those tests again on my machine?

In D8943#267459, @congh wrote:

Sorry for forgetting to clarify on this. After investigating those test results, I found that many large numbers are due to flaky tests. Many tests enjoying 5+% speed-up even don't have any change to be vectorized. However, I think your test results make more sense. So do you want me to do those tests again on my machine?

Ah...that matches my experience with test-suite then. :)
I just wanted to confirm that I wasn't missing some important step with test-suite or with your patch.
I don't see any value in running it all over again. However, for future changes, it would be good to know if your patch is firing on any of those tests, and if so, is there any perf difference? If not, we should locate some new tests!

Since the change is hidden behind a flag, I think this patch is fine. But I will let someone who knows the vectorizers better provide the final approval.

In D8943#267470, @spatel wrote:

In D8943#267459, @congh wrote:

Sorry for forgetting to clarify on this. After investigating those test results, I found that many large numbers are due to flaky tests. Many tests enjoying 5+% speed-up even don't have any change to be vectorized. However, I think your test results make more sense. So do you want me to do those tests again on my machine?

Ah...that matches my experience with test-suite then. :)
I just wanted to confirm that I wasn't missing some important step with test-suite or with your patch.
I don't see any value in running it all over again. However, for future changes, it would be good to know if your patch is firing on any of those tests, and if so, is there any perf difference? If not, we should locate some new tests!

Since the change is hidden behind a flag, I think this patch is fine. But I will let someone who knows the vectorizers better provide the final approval.

OK. Thanks a lot for the review!

Ping?

Hi Cong,

Please find some comments inline.

Michael

lib/Target/X86/X86TargetTransformInfo.cpp
851–853 ↗	(On Diff #36275)	I believe it's an independent fix from the rest of the patch. Please commit it separately.
lib/Transforms/Vectorize/LoopVectorize.cpp
1401 ↗	(On Diff #36275)	Nitpick: redundant whitespace after `\return`
4605 ↗	(On Diff #36275)	I would prefer not changing signature of `calculateRegisterUsage` and build array of `RUs` here instead. I think the original interface (takes a single vectorization factor, return register usage object for it) is more intuitive that the one operating on arrays.
4607 ↗	(On Diff #36275)	Typo: doen't
4994–4995 ↗	(On Diff #36275)	Please commit such formatting fixes separately if you feel that you need them.
test/Transforms/LoopVectorize/X86/vector_max_bandwidth.ll
1 ↗	(On Diff #36275)	You'll need `REQUIRES: asserts` if you scan debug dumps.

In D8943#269553, @mzolotukhin wrote:

Hi Cong,

Please find some comments inline.

Michael

Thank you for the review, Michael! Please see my inline reply.

lib/Target/X86/X86TargetTransformInfo.cpp
851–853 ↗	(On Diff #36275)	This is a dependent fix and without it the test case will crash. But I could commit this fix ahead of this patch.
lib/Transforms/Vectorize/LoopVectorize.cpp
4605 ↗	(On Diff #36275)	The change of the signature is in consideration of performance: the part that calculates the highest number of values that are alive at each location is shared by different VFs (actually most part are shares across different VFs). If we always consider many VFs for a given loop, then this signature also makes sense. Right?

Ping?

mssimpso added a subscriber: mssimpso.Oct 27 2015, 11:57 AM

Update the patch according to Michael's comments.

hfinkel added inline comments.Oct 27 2015, 9:51 PM

lib/Target/X86/X86TargetTransformInfo.cpp
903–905 ↗	(On Diff #38592)	In that case, please do commit it.
lib/Transforms/Vectorize/LoopVectorize.cpp
5166 ↗	(On Diff #38592)	Does this do the right thing for stores? For stores, you want the type of the stored value, right?

congh added inline comments.Oct 28 2015, 4:14 PM

lib/Target/X86/X86TargetTransformInfo.cpp
903–905 ↗	(On Diff #38592)	This is committed now.
lib/Transforms/Vectorize/LoopVectorize.cpp
5011 ↗	(On Diff #35321)	I think stores won't appear here as OpenIntervals only collect instructions that are used. See line 4992 above.

LGTM.

After you commit, send a message to llvmdev asking people to test with the flag enabled.

lib/Transforms/Vectorize/LoopVectorize.cpp
5166 ↗	(On Diff #38592)	Okay, that makes sense. The register will be accounted for at its source.

In D8943#277525, @hfinkel wrote:

LGTM.

After you commit, send a message to llvmdev asking people to test with the flag enabled.

OK. Thanks for the review!

Closed by commit rL251850: Add a flag vectorizer-maximize-bandwidth in loop vectorizer to enable using… (authored by conghou). · Explain WhyNov 2 2015, 2:56 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

132 lines

test/

Transforms/

LoopVectorize/

X86/

vector_max_bandwidth.ll

46 lines

vector_ptr_load_store.ll

8 lines

Diff 38989

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines
/// We don't vectorize loops with a known constant trip count below this number.		/// We don't vectorize loops with a known constant trip count below this number.
static cl::opt<unsigned>		static cl::opt<unsigned>
TinyTripCountVectorThreshold("vectorizer-min-trip-count", cl::init(16),		TinyTripCountVectorThreshold("vectorizer-min-trip-count", cl::init(16),
cl::Hidden,		cl::Hidden,
cl::desc("Don't vectorize loops with a constant "		cl::desc("Don't vectorize loops with a constant "
"trip count that is smaller than this "		"trip count that is smaller than this "
"value."));		"value."));

		static cl::opt<bool> MaximizeBandwidth(
		"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
		cl::desc("Maximize bandwidth when selecting vectorization factor which "
		"will be determined by the smallest type in loop."));

/// This enables versioning on the strides of symbolically striding memory		/// This enables versioning on the strides of symbolically striding memory
/// accesses in code like the following.		/// accesses in code like the following.
/// for (i = 0; i < N; ++i)		/// for (i = 0; i < N; ++i)
/// A[i * Stride1] += B[i * Stride2] ...		/// A[i * Stride1] += B[i * Stride2] ...
///		///
/// Will be roughly translated to		/// Will be roughly translated to
/// if (Stride1 == 1 && Stride2 == 1) {		/// if (Stride1 == 1 && Stride2 == 1) {
/// for (i = 0; i < N; i+=4)		/// for (i = 0; i < N; i+=4)
▲ Show 20 Lines • Show All 1,266 Lines • ▼ Show 20 Lines	struct VectorizationFactor {
unsigned Cost; // Cost of the loop with that width		unsigned Cost; // Cost of the loop with that width
};		};
/// \return The most profitable vectorization factor and the cost of that VF.		/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every power of two up to VF. If UserVF is not ZERO		/// This method checks every power of two up to VF. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is		/// then this vectorization factor will be selected if vectorization is
/// possible.		/// possible.
VectorizationFactor selectVectorizationFactor(bool OptForSize);		VectorizationFactor selectVectorizationFactor(bool OptForSize);

/// \return The size (in bits) of the widest type in the code that		/// \return The size (in bits) of the smallest and widest types in the code
/// needs to be vectorized. We ignore values that remain scalar such as		/// that needs to be vectorized. We ignore values that remain scalar such as
/// 64 bit loop indices.		/// 64 bit loop indices.
unsigned getWidestType();		std::pair<unsigned, unsigned> getSmallestAndWidestTypes();

/// \return The desired interleave count.		/// \return The desired interleave count.
/// If interleave count has been specified by metadata it will be returned.		/// If interleave count has been specified by metadata it will be returned.
/// Otherwise, the interleave count is computed and returned. VF and LoopCost		/// Otherwise, the interleave count is computed and returned. VF and LoopCost
/// are the selected vectorization factor and the cost of the selected VF.		/// are the selected vectorization factor and the cost of the selected VF.
unsigned selectInterleaveCount(bool OptForSize, unsigned VF,		unsigned selectInterleaveCount(bool OptForSize, unsigned VF,
unsigned LoopCost);		unsigned LoopCost);

Show All 10 Lines	struct RegisterUsage {
/// Holds the number of loop invariant values that are used in the loop.		/// Holds the number of loop invariant values that are used in the loop.
unsigned LoopInvariantRegs;		unsigned LoopInvariantRegs;
/// Holds the maximum number of concurrent live intervals in the loop.		/// Holds the maximum number of concurrent live intervals in the loop.
unsigned MaxLocalUsers;		unsigned MaxLocalUsers;
/// Holds the number of instructions in the loop.		/// Holds the number of instructions in the loop.
unsigned NumInstructions;		unsigned NumInstructions;
};		};

/// \return information about the register usage of the loop.		/// \return Returns information about the register usages of the loop for the
RegisterUsage calculateRegisterUsage();		/// given vectorization factors.
		SmallVector<RegisterUsage, 8>
		calculateRegisterUsage(const SmallVector<unsigned, 8> &VFs);

private:		private:
/// Returns the expected execution cost. The unit of the cost does		/// Returns the expected execution cost. The unit of the cost does
/// not matter because we use the 'cost' units to compare different		/// not matter because we use the 'cost' units to compare different
/// vector widths. The cost that is returned is not normalized by		/// vector widths. The cost that is returned is not normalized by
/// the factor width.		/// the factor width.
unsigned expectedCost(unsigned VF);		unsigned expectedCost(unsigned VF);

▲ Show 20 Lines • Show All 3,249 Lines • ▼ Show 20 Lines	if (!EnableCondStoresVectorization && Legal->getNumPredStores()) {
return Factor;		return Factor;
}		}

// Find the trip count.		// Find the trip count.
unsigned TC = SE->getSmallConstantTripCount(TheLoop);		unsigned TC = SE->getSmallConstantTripCount(TheLoop);
DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');		DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);		MinBWs = computeMinimumValueSizes(TheLoop->getBlocks(), *DB, &TTI);
unsigned WidestType = getWidestType();		unsigned SmallestType, WidestType;
		std::tie(SmallestType, WidestType) = getSmallestAndWidestTypes();
unsigned WidestRegister = TTI.getRegisterBitWidth(true);		unsigned WidestRegister = TTI.getRegisterBitWidth(true);
unsigned MaxSafeDepDist = -1U;		unsigned MaxSafeDepDist = -1U;
if (Legal->getMaxSafeDepDistBytes() != -1U)		if (Legal->getMaxSafeDepDistBytes() != -1U)
MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8;		MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8;
WidestRegister = ((WidestRegister < MaxSafeDepDist) ?		WidestRegister = ((WidestRegister < MaxSafeDepDist) ?
WidestRegister : MaxSafeDepDist);		WidestRegister : MaxSafeDepDist);
unsigned MaxVectorSize = WidestRegister / WidestType;		unsigned MaxVectorSize = WidestRegister / WidestType;
DEBUG(dbgs() << "LV: The Widest type: " << WidestType << " bits.\n");
		DEBUG(dbgs() << "LV: The Smallest and Widest types: " << SmallestType << " / "
		<< WidestType << " bits.\n");
DEBUG(dbgs() << "LV: The Widest register is: "		DEBUG(dbgs() << "LV: The Widest register is: "
<< WidestRegister << " bits.\n");		<< WidestRegister << " bits.\n");

if (MaxVectorSize == 0) {		if (MaxVectorSize == 0) {
DEBUG(dbgs() << "LV: The target has no vector registers.\n");		DEBUG(dbgs() << "LV: The target has no vector registers.\n");
MaxVectorSize = 1;		MaxVectorSize = 1;
}		}

assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements"		assert(MaxVectorSize <= 64 && "Did not expect to pack so many elements"
" into one vector!");		" into one vector!");

unsigned VF = MaxVectorSize;		unsigned VF = MaxVectorSize;
		if (MaximizeBandwidth && !OptForSize) {
		// Collect all viable vectorization factors.
		SmallVector<unsigned, 8> VFs;
		unsigned NewMaxVectorSize = WidestRegister / SmallestType;
		for (unsigned VS = MaxVectorSize; VS <= NewMaxVectorSize; VS *= 2)
		VFs.push_back(VS);

		// For each VF calculate its register usage.
		auto RUs = calculateRegisterUsage(VFs);

		// Select the largest VF which doesn't require more registers than existing
		// ones.
		unsigned TargetNumRegisters = TTI.getNumberOfRegisters(true);
		for (int i = RUs.size() - 1; i >= 0; --i) {
		if (RUs[i].MaxLocalUsers <= TargetNumRegisters) {
		VF = VFs[i];
		break;
		}
		}
		}

// If we optimize the program for size, avoid creating the tail loop.		// If we optimize the program for size, avoid creating the tail loop.
if (OptForSize) {		if (OptForSize) {
// If we are unable to calculate the trip count then don't try to vectorize.		// If we are unable to calculate the trip count then don't try to vectorize.
if (TC < 2) {		if (TC < 2) {
emitAnalysis		emitAnalysis
(VectorizationReport() <<		(VectorizationReport() <<
"unable to calculate the loop count due to complex control flow");		"unable to calculate the loop count due to complex control flow");
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()
<< "LV: Vectorization seems to be not beneficial, "		<< "LV: Vectorization seems to be not beneficial, "
<< "but was forced by a user.\n");		<< "but was forced by a user.\n");
DEBUG(dbgs() << "LV: Selecting VF: "<< Width << ".\n");		DEBUG(dbgs() << "LV: Selecting VF: "<< Width << ".\n");
Factor.Width = Width;		Factor.Width = Width;
Factor.Cost = Width * Cost;		Factor.Cost = Width * Cost;
return Factor;		return Factor;
}		}

unsigned LoopVectorizationCostModel::getWidestType() {		std::pair<unsigned, unsigned>
		LoopVectorizationCostModel::getSmallestAndWidestTypes() {
		unsigned MinWidth = -1U;
unsigned MaxWidth = 8;		unsigned MaxWidth = 8;
const DataLayout &DL = TheFunction->getParent()->getDataLayout();		const DataLayout &DL = TheFunction->getParent()->getDataLayout();

// For each block.		// For each block.
for (Loop::block_iterator bb = TheLoop->block_begin(),		for (Loop::block_iterator bb = TheLoop->block_begin(),
be = TheLoop->block_end(); bb != be; ++bb) {		be = TheLoop->block_end(); bb != be; ++bb) {
BasicBlock BB = bb;		BasicBlock BB = bb;

Show All 23 Lines	for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {
T = ST->getValueOperand()->getType();		T = ST->getValueOperand()->getType();

// Ignore loaded pointer types and stored pointer types that are not		// Ignore loaded pointer types and stored pointer types that are not
// consecutive. However, we do want to take consecutive stores/loads of		// consecutive. However, we do want to take consecutive stores/loads of
// pointer vectors into account.		// pointer vectors into account.
if (T->isPointerTy() && !isConsecutiveLoadOrStore(&*it))		if (T->isPointerTy() && !isConsecutiveLoadOrStore(&*it))
continue;		continue;

		MinWidth = std::min(MinWidth,
		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
MaxWidth = std::max(MaxWidth,		MaxWidth = std::max(MaxWidth,
(unsigned)DL.getTypeSizeInBits(T->getScalarType()));		(unsigned)DL.getTypeSizeInBits(T->getScalarType()));
}		}
}		}

return MaxWidth;		return {MinWidth, MaxWidth};
}		}

unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,		unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
unsigned VF,		unsigned VF,
unsigned LoopCost) {		unsigned LoopCost) {

// -- The interleave heuristics --		// -- The interleave heuristics --
// We interleave the loop in order to expose ILP and reduce the loop overhead.		// We interleave the loop in order to expose ILP and reduce the loop overhead.
Show All 29 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
if (VF == 1) {		if (VF == 1) {
if (ForceTargetNumScalarRegs.getNumOccurrences() > 0)		if (ForceTargetNumScalarRegs.getNumOccurrences() > 0)
TargetNumRegisters = ForceTargetNumScalarRegs;		TargetNumRegisters = ForceTargetNumScalarRegs;
} else {		} else {
if (ForceTargetNumVectorRegs.getNumOccurrences() > 0)		if (ForceTargetNumVectorRegs.getNumOccurrences() > 0)
TargetNumRegisters = ForceTargetNumVectorRegs;		TargetNumRegisters = ForceTargetNumVectorRegs;
}		}

LoopVectorizationCostModel::RegisterUsage R = calculateRegisterUsage();		RegisterUsage R = calculateRegisterUsage({VF})[0];
// We divide by these constants so assume that we have at least one		// We divide by these constants so assume that we have at least one
// instruction that uses at least one register.		// instruction that uses at least one register.
R.MaxLocalUsers = std::max(R.MaxLocalUsers, 1U);		R.MaxLocalUsers = std::max(R.MaxLocalUsers, 1U);
R.NumInstructions = std::max(R.NumInstructions, 1U);		R.NumInstructions = std::max(R.NumInstructions, 1U);

// We calculate the interleave count using the following formula.		// We calculate the interleave count using the following formula.
// Subtract the number of loop invariants from the number of available		// Subtract the number of loop invariants from the number of available
// registers. These registers are used by all of the interleaved instances.		// registers. These registers are used by all of the interleaved instances.
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	if (TTI.enableAggressiveInterleaving(HasReductions)) {
DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");		DEBUG(dbgs() << "LV: Interleaving to expose ILP.\n");
return IC;		return IC;
}		}

DEBUG(dbgs() << "LV: Not Interleaving.\n");		DEBUG(dbgs() << "LV: Not Interleaving.\n");
return 1;		return 1;
}		}

LoopVectorizationCostModel::RegisterUsage		SmallVector<LoopVectorizationCostModel::RegisterUsage, 8>
LoopVectorizationCostModel::calculateRegisterUsage() {		LoopVectorizationCostModel::calculateRegisterUsage(
		const SmallVector<unsigned, 8> &VFs) {
// This function calculates the register usage by measuring the highest number		// This function calculates the register usage by measuring the highest number
// of values that are alive at a single location. Obviously, this is a very		// of values that are alive at a single location. Obviously, this is a very
// rough estimation. We scan the loop in a topological order in order and		// rough estimation. We scan the loop in a topological order in order and
// assign a number to each instruction. We use RPO to ensure that defs are		// assign a number to each instruction. We use RPO to ensure that defs are
// met before their users. We assume that each instruction that has in-loop		// met before their users. We assume that each instruction that has in-loop
// users starts an interval. We record every time that an in-loop value is		// users starts an interval. We record every time that an in-loop value is
// used, so we have a list of the first and last occurrences of each		// used, so we have a list of the first and last occurrences of each
// instruction. Next, we transpose this data structure into a multi map that		// instruction. Next, we transpose this data structure into a multi map that
// holds the list of intervals that end at a specific location. This multi		// holds the list of intervals that end at a specific location. This multi
// map allows us to perform a linear search. We scan the instructions linearly		// map allows us to perform a linear search. We scan the instructions linearly
// and record each time that a new interval starts, by placing it in a set.		// and record each time that a new interval starts, by placing it in a set.
// If we find this value in the multi-map then we remove it from the set.		// If we find this value in the multi-map then we remove it from the set.
// The max register usage is the maximum size of the set.		// The max register usage is the maximum size of the set.
// We also search for instructions that are defined outside the loop, but are		// We also search for instructions that are defined outside the loop, but are
// used inside the loop. We need this number separately from the max-interval		// used inside the loop. We need this number separately from the max-interval
// usage number because when we unroll, loop-invariant values do not take		// usage number because when we unroll, loop-invariant values do not take
// more register.		// more register.
LoopBlocksDFS DFS(TheLoop);		LoopBlocksDFS DFS(TheLoop);
DFS.perform(LI);		DFS.perform(LI);

RegisterUsage R;		RegisterUsage RU;
R.NumInstructions = 0;		RU.NumInstructions = 0;

// Each 'key' in the map opens a new interval. The values		// Each 'key' in the map opens a new interval. The values
// of the map are the index of the 'last seen' usage of the		// of the map are the index of the 'last seen' usage of the
// instruction that is the key.		// instruction that is the key.
typedef DenseMap<Instruction*, unsigned> IntervalMap;		typedef DenseMap<Instruction*, unsigned> IntervalMap;
// Maps instruction to its index.		// Maps instruction to its index.
DenseMap<unsigned, Instruction*> IdxToInstr;		DenseMap<unsigned, Instruction*> IdxToInstr;
// Marks the end of each interval.		// Marks the end of each interval.
IntervalMap EndPoint;		IntervalMap EndPoint;
// Saves the list of instruction indices that are used in the loop.		// Saves the list of instruction indices that are used in the loop.
SmallSet<Instruction*, 8> Ends;		SmallSet<Instruction*, 8> Ends;
// Saves the list of values that are used in the loop but are		// Saves the list of values that are used in the loop but are
// defined outside the loop, such as arguments and constants.		// defined outside the loop, such as arguments and constants.
SmallPtrSet<Value*, 8> LoopInvariants;		SmallPtrSet<Value*, 8> LoopInvariants;

unsigned Index = 0;		unsigned Index = 0;
for (LoopBlocksDFS::RPOIterator bb = DFS.beginRPO(),		for (LoopBlocksDFS::RPOIterator bb = DFS.beginRPO(),
be = DFS.endRPO(); bb != be; ++bb) {		be = DFS.endRPO(); bb != be; ++bb) {
R.NumInstructions += (*bb)->size();		RU.NumInstructions += (*bb)->size();
for (Instruction &I : **bb) {		for (Instruction &I : **bb) {
IdxToInstr[Index++] = &I;		IdxToInstr[Index++] = &I;

// Save the end location of each USE.		// Save the end location of each USE.
for (unsigned i = 0; i < I.getNumOperands(); ++i) {		for (unsigned i = 0; i < I.getNumOperands(); ++i) {
Value *U = I.getOperand(i);		Value *U = I.getOperand(i);
Instruction *Instr = dyn_cast<Instruction>(U);		Instruction *Instr = dyn_cast<Instruction>(U);

Show All 18 Lines	LoopVectorizationCostModel::calculateRegisterUsage(
DenseMap<unsigned, InstrList> TransposeEnds;		DenseMap<unsigned, InstrList> TransposeEnds;

// Transpose the EndPoints to a list of values that end at each index.		// Transpose the EndPoints to a list of values that end at each index.
for (IntervalMap::iterator it = EndPoint.begin(), e = EndPoint.end();		for (IntervalMap::iterator it = EndPoint.begin(), e = EndPoint.end();
it != e; ++it)		it != e; ++it)
TransposeEnds[it->second].push_back(it->first);		TransposeEnds[it->second].push_back(it->first);

SmallSet<Instruction*, 8> OpenIntervals;		SmallSet<Instruction*, 8> OpenIntervals;
unsigned MaxUsage = 0;

		// Get the size of the widest register.
		unsigned MaxSafeDepDist = -1U;
		if (Legal->getMaxSafeDepDistBytes() != -1U)
		MaxSafeDepDist = Legal->getMaxSafeDepDistBytes() * 8;
		unsigned WidestRegister =
		std::min(TTI.getRegisterBitWidth(true), MaxSafeDepDist);
		const DataLayout &DL = TheFunction->getParent()->getDataLayout();

		SmallVector<RegisterUsage, 8> RUs(VFs.size());
		SmallVector<unsigned, 8> MaxUsages(VFs.size(), 0);

DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");		DEBUG(dbgs() << "LV(REG): Calculating max register usage:\n");

		// A lambda that gets the register usage for the given type and VF.
		auto GetRegUsage = [&DL, WidestRegister](Type *Ty, unsigned VF) {
		unsigned TypeSize = DL.getTypeSizeInBits(Ty->getScalarType());
		return std::max<unsigned>(1, VF * TypeSize / WidestRegister);
		};

for (unsigned int i = 0; i < Index; ++i) {		for (unsigned int i = 0; i < Index; ++i) {
Instruction *I = IdxToInstr[i];		Instruction *I = IdxToInstr[i];
// Ignore instructions that are never used within the loop.		// Ignore instructions that are never used within the loop.
if (!Ends.count(I)) continue;		if (!Ends.count(I)) continue;

// Skip ignored values.		// Skip ignored values.
if (ValuesToIgnore.count(I))		if (ValuesToIgnore.count(I))
continue;		continue;

// Remove all of the instructions that end at this location.		// Remove all of the instructions that end at this location.
InstrList &List = TransposeEnds[i];		InstrList &List = TransposeEnds[i];
for (unsigned int j=0, e = List.size(); j < e; ++j)		for (unsigned int j = 0, e = List.size(); j < e; ++j)
OpenIntervals.erase(List[j]);		OpenIntervals.erase(List[j]);

		// For each VF find the maximum usage of registers.
		for (unsigned j = 0, e = VFs.size(); j < e; ++j) {
		if (VFs[j] == 1) {
		MaxUsages[j] = std::max(MaxUsages[j], OpenIntervals.size());
		continue;
		}

// Count the number of live interals.		// Count the number of live interals.
MaxUsage = std::max(MaxUsage, OpenIntervals.size());		unsigned RegUsage = 0;
		for (auto Inst : OpenIntervals)
		RegUsage += GetRegUsage(Inst->getType(), VFs[j]);
		MaxUsages[j] = std::max(MaxUsages[j], RegUsage);
		}

DEBUG(dbgs() << "LV(REG): At #" << i << " Interval # " <<		DEBUG(dbgs() << "LV(REG): At #" << i << " Interval # "
OpenIntervals.size() << '\n');		<< OpenIntervals.size() << '\n');

// Add the current instruction to the list of open intervals.		// Add the current instruction to the list of open intervals.
OpenIntervals.insert(I);		OpenIntervals.insert(I);
}		}

unsigned Invariant = LoopInvariants.size();		for (unsigned i = 0, e = VFs.size(); i < e; ++i) {
DEBUG(dbgs() << "LV(REG): Found max usage: " << MaxUsage << '\n');		unsigned Invariant = 0;
		if (VFs[i] == 1)
		Invariant = LoopInvariants.size();
		else {
		for (auto Inst : LoopInvariants)
		Invariant += GetRegUsage(Inst->getType(), VFs[i]);
		}

		DEBUG(dbgs() << "LV(REG): VF = " << VFs[i] << '\n');
		DEBUG(dbgs() << "LV(REG): Found max usage: " << MaxUsages[i] << '\n');
DEBUG(dbgs() << "LV(REG): Found invariant usage: " << Invariant << '\n');		DEBUG(dbgs() << "LV(REG): Found invariant usage: " << Invariant << '\n');
DEBUG(dbgs() << "LV(REG): LoopSize: " << R.NumInstructions << '\n');		DEBUG(dbgs() << "LV(REG): LoopSize: " << RU.NumInstructions << '\n');

		RU.LoopInvariantRegs = Invariant;
		RU.MaxLocalUsers = MaxUsages[i];
		RUs[i] = RU;
		}

R.LoopInvariantRegs = Invariant;		return RUs;
R.MaxLocalUsers = MaxUsage;
return R;
}		}

unsigned LoopVectorizationCostModel::expectedCost(unsigned VF) {		unsigned LoopVectorizationCostModel::expectedCost(unsigned VF) {
unsigned Cost = 0;		unsigned Cost = 0;

// For each block.		// For each block.
for (Loop::block_iterator bb = TheLoop->block_begin(),		for (Loop::block_iterator bb = TheLoop->block_begin(),
be = TheLoop->block_end(); bb != be; ++bb) {		be = TheLoop->block_end(); bb != be; ++bb) {
▲ Show 20 Lines • Show All 519 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/vector_max_bandwidth.ll

				; RUN: opt -loop-vectorize -vectorizer-maximize-bandwidth -mcpu=corei7-avx -debug-only=loop-vectorize -S < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				@a = global [1000 x i8] zeroinitializer, align 16
				@b = global [1000 x i8] zeroinitializer, align 16
				@c = global [1000 x i8] zeroinitializer, align 16
				@u = global [1000 x i32] zeroinitializer, align 16
				@v = global [1000 x i32] zeroinitializer, align 16
				@w = global [1000 x i32] zeroinitializer, align 16

				; Tests that the vectorization factor is determined by the smallest instead of
				; widest type in the loop for maximum bandwidth when
				; -vectorizer-maximize-bandwidth is indicated.
				;
				; CHECK-label: foo
				; CHECK: LV: Selecting VF: 16.
				define void @foo() {
				entry:
				br label %for.body

				for.cond.cleanup:
				ret void

				for.body:
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds [1000 x i8], [1000 x i8]* @b, i64 0, i64 %indvars.iv
				%0 = load i8, i8* %arrayidx, align 1
				%arrayidx2 = getelementptr inbounds [1000 x i8], [1000 x i8]* @c, i64 0, i64 %indvars.iv
				%1 = load i8, i8* %arrayidx2, align 1
				%add = add i8 %1, %0
				%arrayidx6 = getelementptr inbounds [1000 x i8], [1000 x i8]* @a, i64 0, i64 %indvars.iv
				store i8 %add, i8* %arrayidx6, align 1
				%arrayidx8 = getelementptr inbounds [1000 x i32], [1000 x i32]* @v, i64 0, i64 %indvars.iv
				%2 = load i32, i32* %arrayidx8, align 4
				%arrayidx10 = getelementptr inbounds [1000 x i32], [1000 x i32]* @w, i64 0, i64 %indvars.iv
				%3 = load i32, i32* %arrayidx10, align 4
				%add11 = add nsw i32 %3, %2
				%arrayidx13 = getelementptr inbounds [1000 x i32], [1000 x i32]* @u, i64 0, i64 %indvars.iv
				store i32 %add11, i32* %arrayidx13, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1000
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

llvm/trunk/test/Transforms/LoopVectorize/X86/vector_ptr_load_store.ll

	Show All 11 Lines
	@r = global [2048 x i16] zeroinitializer, align 16			@r = global [2048 x i16] zeroinitializer, align 16

	; Tests for widest type			; Tests for widest type
	; Ensure that we count the pointer store in the first test case. We have a			; Ensure that we count the pointer store in the first test case. We have a
	; consecutive vector of pointers store, therefore we should count it towards the			; consecutive vector of pointers store, therefore we should count it towards the
	; widest vector count.			; widest vector count.
	;			;
	; CHECK: test_consecutive_store			; CHECK: test_consecutive_store
	; CHECK: The Widest type: 64 bits			; CHECK: The Smallest and Widest types: 64 / 64 bits.
	define void @test_consecutive_store(%0, %0, %0** nocapture) nounwind ssp uwtable align 2 {			define void @test_consecutive_store(%0, %0, %0** nocapture) nounwind ssp uwtable align 2 {
	%4 = load %0, %0* %2, align 8			%4 = load %0, %0* %2, align 8
	%5 = icmp eq %0** %0, %1			%5 = icmp eq %0** %0, %1
	br i1 %5, label %12, label %6			br i1 %5, label %12, label %6

	; <label>:6 ; preds = %3			; <label>:6 ; preds = %3
	br label %7			br label %7

	Show All 17 Lines
	; therefore the widest type should be i16.			; therefore the widest type should be i16.
	; int* p[2048][8];			; int* p[2048][8];
	; short q[2048];			; short q[2048];
	; for (int y = 0; y < 8; ++y)			; for (int y = 0; y < 8; ++y)
	; for (int i = 0; i < 1024; ++i) {			; for (int i = 0; i < 1024; ++i) {
	; p[i][y] = (int*) (1 + q[i]);			; p[i][y] = (int*) (1 + q[i]);
	; }			; }
	; CHECK: test_nonconsecutive_store			; CHECK: test_nonconsecutive_store
	; CHECK: The Widest type: 16 bits			; CHECK: The Smallest and Widest types: 16 / 16 bits.
	define void @test_nonconsecutive_store() nounwind ssp uwtable {			define void @test_nonconsecutive_store() nounwind ssp uwtable {
	br label %1			br label %1

	; <label>:1 ; preds = %14, %0			; <label>:1 ; preds = %14, %0
	%2 = phi i64 [ 0, %0 ], [ %15, %14 ]			%2 = phi i64 [ 0, %0 ], [ %15, %14 ]
	br label %3			br label %3

	; <label>:3 ; preds = %3, %1			; <label>:3 ; preds = %3, %1
	Show All 25 Lines
	@ib = global [1024 x i32] zeroinitializer, align 16			@ib = global [1024 x i32] zeroinitializer, align 16
	@ic = global [1024 x i8] zeroinitializer, align 16			@ic = global [1024 x i8] zeroinitializer, align 16
	@p2 = global [2048 x [8 x i32*]] zeroinitializer, align 16			@p2 = global [2048 x [8 x i32*]] zeroinitializer, align 16
	@q2 = global [2048 x i16] zeroinitializer, align 16			@q2 = global [2048 x i16] zeroinitializer, align 16

	;; Now we check the same rules for loads. We should take consecutive loads of			;; Now we check the same rules for loads. We should take consecutive loads of
	;; pointer types into account.			;; pointer types into account.
	; CHECK: test_consecutive_ptr_load			; CHECK: test_consecutive_ptr_load
	; CHECK: The Widest type: 64 bits			; CHECK: The Smallest and Widest types: 8 / 64 bits.
	define i8 @test_consecutive_ptr_load() nounwind readonly ssp uwtable {			define i8 @test_consecutive_ptr_load() nounwind readonly ssp uwtable {
	br label %1			br label %1

	; <label>:1 ; preds = %1, %0			; <label>:1 ; preds = %1, %0
	%2 = phi i64 [ 0, %0 ], [ %10, %1 ]			%2 = phi i64 [ 0, %0 ], [ %10, %1 ]
	%3 = phi i8 [ 0, %0 ], [ %9, %1 ]			%3 = phi i8 [ 0, %0 ], [ %9, %1 ]
	%4 = getelementptr inbounds [1024 x i32], [1024 x i32]* @ia, i32 0, i64 %2			%4 = getelementptr inbounds [1024 x i32], [1024 x i32]* @ia, i32 0, i64 %2
	%5 = load i32, i32* %4, align 4			%5 = load i32, i32* %4, align 4
	%6 = ptrtoint i32* %5 to i64			%6 = ptrtoint i32* %5 to i64
	%7 = trunc i64 %6 to i8			%7 = trunc i64 %6 to i8
	%8 = add i8 %3, 1			%8 = add i8 %3, 1
	%9 = add i8 %7, %8			%9 = add i8 %7, %8
	%10 = add i64 %2, 1			%10 = add i64 %2, 1
	%11 = icmp ne i64 %10, 1024			%11 = icmp ne i64 %10, 1024
	br i1 %11, label %1, label %12			br i1 %11, label %1, label %12

	; <label>:12 ; preds = %1			; <label>:12 ; preds = %1
	%13 = phi i8 [ %9, %1 ]			%13 = phi i8 [ %9, %1 ]
	ret i8 %13			ret i8 %13
	}			}

	;; However, we should not take unconsecutive loads of pointers into account.			;; However, we should not take unconsecutive loads of pointers into account.
	; CHECK: test_nonconsecutive_ptr_load			; CHECK: test_nonconsecutive_ptr_load
	; CHECK: The Widest type: 16 bits			; CHECK: LV: The Smallest and Widest types: 16 / 16 bits.
	define void @test_nonconsecutive_ptr_load() nounwind ssp uwtable {			define void @test_nonconsecutive_ptr_load() nounwind ssp uwtable {
	br label %1			br label %1

	; <label>:1 ; preds = %13, %0			; <label>:1 ; preds = %13, %0
	%2 = phi i64 [ 0, %0 ], [ %14, %13 ]			%2 = phi i64 [ 0, %0 ], [ %14, %13 ]
	br label %3			br label %3

	; <label>:3 ; preds = %3, %1			; <label>:3 ; preds = %3, %1
	Show All 22 Lines