This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
polly/trunk/
-
trunk/
-
lib/Transform/
-
Transform/
-
ScheduleOptimizer.cpp
-
test/ScheduleOptimizer/
-
ScheduleOptimizer/
-
pattern-matching-based-opts_7.ll
-
pattern-matching-based-opts_8.ll

Differential D29269

[Polly] Use the size of the widest type of the matrix multiplication operands
ClosedPublic

Authored by gareevroman on Jan 29 2017, 11:26 PM.

Download Raw Diff

Details

Reviewers

Meinersbur
grosser
jdoerfert

Commits

rG3d4eae31ea28: Use the size of the widest type of the matrix multiplication operands
rPLO294828: Use the size of the widest type of the matrix multiplication operands
rL294828: Use the size of the widest type of the matrix multiplication operands

Summary

The size of the operands type is the one of the parameters required to determine the BLIS micro-kernel. We get the size of the widest type of the matrix multiplication operands in case there are several different types.

Diff Detail

Repository: rL LLVM

Event Timeline

gareevroman created this revision.Jan 29 2017, 11:26 PM

gareevroman added a parent revision: D29244: [Polly] Isolate a set of partial tile prefixes to allow hoisting and sinking out of the unrolled innermost loops produced by the optimization of the matrix multiplication..

Undoubtedly this change makes sense.

lib/Transform/ScheduleOptimizer.cpp
907–908 ↗	(On Diff #86244)	Is there a guarantee/check somewhere that A and B are both primitive types? I wonder about that e.g. A is a float, but B is a larger struct, so B->getPrimitiveSizeInBits() would return zero. `getMatMulTypeSize` would still return nonzero, but we'd not get enough space in a register for elements of B.
test/ScheduleOptimizer/pattern-matching-based-opts_6.ll
18–101 ↗	(On Diff #86244)	I don't see how the elements per register result manifests in this schedule. Is it the number of `Stmt_for_body6` in the innermost loop for register tiling? Cold you add a small comment what these tests are supposed to check?

This revision is now accepted and ready to land.Jan 31 2017, 6:07 AM

Hi Michael,

thanks for the comments! I've tried to address them and also fix the issue related to the missed hard-coded type size.

Is there a guarantee/check somewhere that A and B are both primitive types? I wonder about that e.g. A is a float, but B is a larger struct, so B->getPrimitiveSizeInBits() would return zero. getMatMulTypeSize would still return nonzero, but we'd not get enough space in a register for elements of B.

Right. Could we use getTypeAllocSize to get sizes?

In D29269#666725, @gareevroman wrote:

Is there a guarantee/check somewhere that A and B are both primitive types? I wonder about that e.g. A is a float, but B is a larger struct, so B->getPrimitiveSizeInBits() would return zero. getMatMulTypeSize would still return nonzero, but we'd not get enough space in a register for elements of B.

Right. Could we use getTypeAllocSize to get sizes?

getTypeAllocSize() is wrong. E.g. for 8-bit char it would return 64 on most 64 platforms (it's alignment), but e.g. SSE can put 16 of them into an 128 bit xmm register. I suggest to use getTypeSizeInBits().

Michael

getTypeAllocSize() is wrong. E.g. for 8-bit char it would return 64 on most 64 platforms (it's alignment), but e.g. SSE can put 16 of them into an 128 bit xmm register.

I've tried to reproduce it on x86-64 (the test case can be found in https://reviews.llvm.org/D29814). However, getTypeAllocSize() returns 1 for 8-bit char. Could you please advise me how to reproduce it?

I suggest to use getTypeSizeInBits().

Right. We should probably use it to compute the number of elements that can be held by a vector register.

However, in case of mapping elements to L1 cache ([1], p. 11), we should probably use getTypeAllocSize(), since we rely on the location of consecutive data in memory.

This approach is implemented in the new version of the patch.

Refs.:

[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

gareevroman added a child revision: D29814: [Polly] Check reduction dependencies in case of the matrix multiplication optimization.Feb 10 2017, 12:32 AM

In D29269#673282, @gareevroman wrote:

getTypeAllocSize() is wrong. E.g. for 8-bit char it would return 64 on most 64 platforms (it's alignment), but e.g. SSE can put 16 of them into an 128 bit xmm register.

I've tried to reproduce it on x86-64 (the test case can be found in https://reviews.llvm.org/D29814). However, getTypeAllocSize() returns 1 for 8-bit char. Could you please advise me how to reproduce it?

It is a bit more complicated than I imagined. What this takes into account is "ABIAlignment", which depends on the platform. What we are looking for is a type that occupies more space per element in an array than sizeof() returns. This is possible with a struct { int i; char i; },

An example without struct is X86's long double type with an TypeSize of 80 bits and an AllocSize of 128 bits.

I suggest to use getTypeSizeInBits().

Right. We should probably use it to compute the number of elements that can be held by a vector register.

However, in case of mapping elements to L1 cache ([1], p. 11), we should probably use getTypeAllocSize(), since we rely on the location of consecutive data in memory.

Agreed.

In D29269#673632, @Meinersbur wrote:

In D29269#673282, @gareevroman wrote:

getTypeAllocSize() is wrong. E.g. for 8-bit char it would return 64 on most 64 platforms (it's alignment), but e.g. SSE can put 16 of them into an 128 bit xmm register.

I've tried to reproduce it on x86-64 (the test case can be found in https://reviews.llvm.org/D29814). However, getTypeAllocSize() returns 1 for 8-bit char. Could you please advise me how to reproduce it?

It is a bit more complicated than I imagined. What this takes into account is "ABIAlignment", which depends on the platform. What we are looking for is a type that occupies more space per element in an array than sizeof() returns. This is possible with a struct { int i; char i; },

An example without struct is X86's long double type with an TypeSize of 80 bits and an AllocSize of 128 bits.

OK. Thanks.

Closed by commit rL294828: Use the size of the widest type of the matrix multiplication operands (authored by romangareev). · Explain WhyFeb 10 2017, 11:11 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

polly/

trunk/

lib/

Transform/

ScheduleOptimizer.cpp

56 lines

test/

ScheduleOptimizer/

pattern-matching-based-opts_7.ll

156 lines

pattern-matching-based-opts_8.ll

126 lines

Diff 88084

polly/trunk/lib/Transform/ScheduleOptimizer.cpp

Show First 20 Lines • Show All 894 Lines • ▼ Show 20 Lines	__isl_give isl_schedule_node *ScheduleTreeOptimizer::createMacroKernel(
TileSizes[DimOutNum - 1] = MacroKernelParams.Kc;		TileSizes[DimOutNum - 1] = MacroKernelParams.Kc;
Node = tileNode(Node, "1st level tiling", TileSizes, 1);		Node = tileNode(Node, "1st level tiling", TileSizes, 1);
Node = isl_schedule_node_parent(isl_schedule_node_parent(Node));		Node = isl_schedule_node_parent(isl_schedule_node_parent(Node));
Node = permuteBandNodeDimensions(Node, DimOutNum - 2, DimOutNum - 1);		Node = permuteBandNodeDimensions(Node, DimOutNum - 2, DimOutNum - 1);
Node = permuteBandNodeDimensions(Node, DimOutNum - 3, DimOutNum - 1);		Node = permuteBandNodeDimensions(Node, DimOutNum - 3, DimOutNum - 1);
return isl_schedule_node_child(isl_schedule_node_child(Node, 0), 0);		return isl_schedule_node_child(isl_schedule_node_child(Node, 0), 0);
}		}

		/// Get the size of the widest type of the matrix multiplication operands
		/// in bytes, including alignment padding.
		///
		/// @param MMI Parameters of the matrix multiplication operands.
		/// @return The size of the widest type of the matrix multiplication operands
		/// in bytes, including alignment padding.
		static uint64_t getMatMulAlignTypeSize(MatMulInfoTy MMI) {
		auto *S = MMI.A->getStatement()->getParent();
		auto &DL = S->getFunction().getParent()->getDataLayout();
		auto ElementSizeA = DL.getTypeAllocSize(MMI.A->getElementType());
		auto ElementSizeB = DL.getTypeAllocSize(MMI.B->getElementType());
		auto ElementSizeC = DL.getTypeAllocSize(MMI.WriteToC->getElementType());
		return std::max({ElementSizeA, ElementSizeB, ElementSizeC});
		}

		/// Get the size of the widest type of the matrix multiplication operands
		/// in bits.
		///
		/// @param MMI Parameters of the matrix multiplication operands.
		/// @return The size of the widest type of the matrix multiplication operands
		/// in bits.
		static uint64_t getMatMulTypeSize(MatMulInfoTy MMI) {
		auto *S = MMI.A->getStatement()->getParent();
		auto &DL = S->getFunction().getParent()->getDataLayout();
		auto ElementSizeA = DL.getTypeSizeInBits(MMI.A->getElementType());
		auto ElementSizeB = DL.getTypeSizeInBits(MMI.B->getElementType());
		auto ElementSizeC = DL.getTypeSizeInBits(MMI.WriteToC->getElementType());
		return std::max({ElementSizeA, ElementSizeB, ElementSizeC});
		}

/// Get parameters of the BLIS micro kernel.		/// Get parameters of the BLIS micro kernel.
///		///
/// We choose the Mr and Nr parameters of the micro kernel to be large enough		/// We choose the Mr and Nr parameters of the micro kernel to be large enough
/// such that no stalls caused by the combination of latencies and dependencies		/// such that no stalls caused by the combination of latencies and dependencies
/// are introduced during the updates of the resulting matrix of the matrix		/// are introduced during the updates of the resulting matrix of the matrix
/// multiplication. However, they should also be as small as possible to		/// multiplication. However, they should also be as small as possible to
/// release more registers for entries of multiplied matrices.		/// release more registers for entries of multiplied matrices.
///		///
/// @param TTI Target Transform Info.		/// @param TTI Target Transform Info.
		/// @param MMI Parameters of the matrix multiplication operands.
/// @return The structure of type MicroKernelParamsTy.		/// @return The structure of type MicroKernelParamsTy.
/// @see MicroKernelParamsTy		/// @see MicroKernelParamsTy
static struct MicroKernelParamsTy		static struct MicroKernelParamsTy
getMicroKernelParams(const llvm::TargetTransformInfo *TTI) {		getMicroKernelParams(const llvm::TargetTransformInfo *TTI, MatMulInfoTy MMI) {
assert(TTI && "The target transform info should be provided.");		assert(TTI && "The target transform info should be provided.");

// Nvec - Number of double-precision floating-point numbers that can be hold		// Nvec - Number of double-precision floating-point numbers that can be hold
// by a vector register. Use 2 by default.		// by a vector register. Use 2 by default.
long RegisterBitwidth = VectorRegisterBitwidth;		long RegisterBitwidth = VectorRegisterBitwidth;

if (RegisterBitwidth == -1)		if (RegisterBitwidth == -1)
RegisterBitwidth = TTI->getRegisterBitWidth(true);		RegisterBitwidth = TTI->getRegisterBitWidth(true);
auto Nvec = RegisterBitwidth / 64;		auto ElementSize = getMatMulTypeSize(MMI);
		assert(ElementSize > 0 && "The element size of the matrix multiplication "
		"operands should be greater than zero.");
		auto Nvec = RegisterBitwidth / ElementSize;
if (Nvec == 0)		if (Nvec == 0)
Nvec = 2;		Nvec = 2;
int Nr =		int Nr =
ceil(sqrt(Nvec * LatencyVectorFma * ThroughputVectorFma) / Nvec) * Nvec;		ceil(sqrt(Nvec * LatencyVectorFma * ThroughputVectorFma) / Nvec) * Nvec;
int Mr = ceil(Nvec * LatencyVectorFma * ThroughputVectorFma / Nr);		int Mr = ceil(Nvec * LatencyVectorFma * ThroughputVectorFma / Nr);
return {Mr, Nr};		return {Mr, Nr};
}		}

/// Get parameters of the BLIS macro kernel.		/// Get parameters of the BLIS macro kernel.
///		///
/// During the computation of matrix multiplication, blocks of partitioned		/// During the computation of matrix multiplication, blocks of partitioned
/// matrices are mapped to different layers of the memory hierarchy.		/// matrices are mapped to different layers of the memory hierarchy.
/// To optimize data reuse, blocks should be ideally kept in cache between		/// To optimize data reuse, blocks should be ideally kept in cache between
/// iterations. Since parameters of the macro kernel determine sizes of these		/// iterations. Since parameters of the macro kernel determine sizes of these
/// blocks, there are upper and lower bounds on these parameters.		/// blocks, there are upper and lower bounds on these parameters.
///		///
/// @param MicroKernelParams Parameters of the micro-kernel		/// @param MicroKernelParams Parameters of the micro-kernel
/// to be taken into account.		/// to be taken into account.
		/// @param MMI Parameters of the matrix multiplication operands.
/// @return The structure of type MacroKernelParamsTy.		/// @return The structure of type MacroKernelParamsTy.
/// @see MacroKernelParamsTy		/// @see MacroKernelParamsTy
/// @see MicroKernelParamsTy		/// @see MicroKernelParamsTy
static struct MacroKernelParamsTy		static struct MacroKernelParamsTy
getMacroKernelParams(const MicroKernelParamsTy &MicroKernelParams) {		getMacroKernelParams(const MicroKernelParamsTy &MicroKernelParams,
		MatMulInfoTy MMI) {
// According to www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf,		// According to www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf,
// it requires information about the first two levels of a cache to determine		// it requires information about the first two levels of a cache to determine
// all the parameters of a macro-kernel. It also checks that an associativity		// all the parameters of a macro-kernel. It also checks that an associativity
// degree of a cache level is greater than two. Otherwise, another algorithm		// degree of a cache level is greater than two. Otherwise, another algorithm
// for determination of the parameters should be used.		// for determination of the parameters should be used.
if (!(MicroKernelParams.Mr > 0 && MicroKernelParams.Nr > 0 &&		if (!(MicroKernelParams.Mr > 0 && MicroKernelParams.Nr > 0 &&
FirstCacheLevelSize > 0 && SecondCacheLevelSize > 0 &&		FirstCacheLevelSize > 0 && SecondCacheLevelSize > 0 &&
FirstCacheLevelAssociativity > 2 && SecondCacheLevelAssociativity > 2))		FirstCacheLevelAssociativity > 2 && SecondCacheLevelAssociativity > 2))
return {1, 1, 1};		return {1, 1, 1};
// The quotient should be greater than zero.		// The quotient should be greater than zero.
if (PollyPatternMatchingNcQuotient <= 0)		if (PollyPatternMatchingNcQuotient <= 0)
return {1, 1, 1};		return {1, 1, 1};
int Car = floor(		int Car = floor(
(FirstCacheLevelAssociativity - 1) /		(FirstCacheLevelAssociativity - 1) /
(1 + static_cast<double>(MicroKernelParams.Nr) / MicroKernelParams.Mr));		(1 + static_cast<double>(MicroKernelParams.Nr) / MicroKernelParams.Mr));
		auto ElementSize = getMatMulAlignTypeSize(MMI);
		assert(ElementSize > 0 && "The element size of the matrix multiplication "
		"operands should be greater than zero.");
int Kc = (Car * FirstCacheLevelSize) /		int Kc = (Car * FirstCacheLevelSize) /
(MicroKernelParams.Mr * FirstCacheLevelAssociativity * 8);		(MicroKernelParams.Mr * FirstCacheLevelAssociativity * ElementSize);
double Cac = static_cast<double>(Kc * 8 * SecondCacheLevelAssociativity) /		double Cac =
		static_cast<double>(Kc * ElementSize * SecondCacheLevelAssociativity) /
SecondCacheLevelSize;		SecondCacheLevelSize;
int Mc = floor((SecondCacheLevelAssociativity - 2) / Cac);		int Mc = floor((SecondCacheLevelAssociativity - 2) / Cac);
int Nc = PollyPatternMatchingNcQuotient * MicroKernelParams.Nr;		int Nc = PollyPatternMatchingNcQuotient * MicroKernelParams.Nr;
return {Mc, Nc, Kc};		return {Mc, Nc, Kc};
}		}

/// Create an access relation that is specific to		/// Create an access relation that is specific to
/// the matrix multiplication pattern.		/// the matrix multiplication pattern.
///		///
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	assert(DimOutNum > 2 && "In case of the matrix multiplication the loop nest "
"and, consequently, the corresponding scheduling "		"and, consequently, the corresponding scheduling "
"functions have at least three dimensions.");		"functions have at least three dimensions.");
Node = permuteBandNodeDimensions(Node, MMI.i, DimOutNum - 3);		Node = permuteBandNodeDimensions(Node, MMI.i, DimOutNum - 3);
int NewJ = MMI.j == DimOutNum - 3 ? MMI.i : MMI.j;		int NewJ = MMI.j == DimOutNum - 3 ? MMI.i : MMI.j;
int NewK = MMI.k == DimOutNum - 3 ? MMI.i : MMI.k;		int NewK = MMI.k == DimOutNum - 3 ? MMI.i : MMI.k;
Node = permuteBandNodeDimensions(Node, NewJ, DimOutNum - 2);		Node = permuteBandNodeDimensions(Node, NewJ, DimOutNum - 2);
NewK = MMI.k == DimOutNum - 2 ? MMI.j : MMI.k;		NewK = MMI.k == DimOutNum - 2 ? MMI.j : MMI.k;
Node = permuteBandNodeDimensions(Node, NewK, DimOutNum - 1);		Node = permuteBandNodeDimensions(Node, NewK, DimOutNum - 1);
auto MicroKernelParams = getMicroKernelParams(TTI);		auto MicroKernelParams = getMicroKernelParams(TTI, MMI);
auto MacroKernelParams = getMacroKernelParams(MicroKernelParams);		auto MacroKernelParams = getMacroKernelParams(MicroKernelParams, MMI);
Node = createMacroKernel(Node, MacroKernelParams);		Node = createMacroKernel(Node, MacroKernelParams);
Node = createMicroKernel(Node, MicroKernelParams);		Node = createMicroKernel(Node, MicroKernelParams);
if (MacroKernelParams.Mc == 1 \|\| MacroKernelParams.Nc == 1 \|\|		if (MacroKernelParams.Mc == 1 \|\| MacroKernelParams.Nc == 1 \|\|
MacroKernelParams.Kc == 1)		MacroKernelParams.Kc == 1)
return Node;		return Node;
auto *MapOldIndVar = getInductionVariablesSubstitution(		auto *MapOldIndVar = getInductionVariablesSubstitution(
Node, MicroKernelParams, MacroKernelParams);		Node, MicroKernelParams, MacroKernelParams);
if (!MapOldIndVar)		if (!MapOldIndVar)
▲ Show 20 Lines • Show All 311 Lines • Show Last 20 Lines

polly/trunk/test/ScheduleOptimizer/pattern-matching-based-opts_7.ll

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-target-throughput-vector-fma=1 \
				; RUN: -polly-target-latency-vector-fma=8 \
				; RUN: -analyze -polly-ast -polly-target-1st-cache-level-associativity=8 \
				; RUN: -polly-target-2nd-cache-level-associativity=8 \
				; RUN: -polly-target-1st-cache-level-size=32768 \
				; RUN: -polly-target-vector-register-bitwidth=256 \
				; RUN: -polly-target-2nd-cache-level-size=262144 < %s \
				; RUN: \| FileCheck %s
				;
				; /* C := A * B + C */
				; /* Elements of the matrices A, B, C have the float type. */
				; /* The type size of elements of the matrix multiplication operands is used
				; to determine the parameters of the code produced by the optimization
				; of the matrix multiplication (e.g. bounds of the loops of the loop
				; nest, the innermost loop body). This test checks the form of
				; the generated loop nest. See getMicroKernelParams and
				; getMacroKernelParams from lib/Transform/ScheduleOptimizer.cpp
				; for details. */
				; for (i = 0; i < _PB_NI; i++)
				; for (j = 0; j < _PB_NJ; j++)
				; for (k = 0; k < _PB_NK; ++k)
				; C[i][j] += A[i][k] * B[k][j];
				;
				; CHECK: // 1st level tiling - Tiles
				; CHECK-NEXT: for (int c1 = 0; c1 <= 2; c1 += 1) {
				; CHECK-NEXT: for (int c3 = 0; c3 <= 1023; c3 += 1)
				; CHECK-NEXT: for (int c4 = 384 * c1; c4 <= min(1023, 384 * c1 + 383); c4 += 1)
				; CHECK-NEXT: CopyStmt_0(0, c3, c4);
				; CHECK-NEXT: for (int c2 = 0; c2 <= 7; c2 += 1) {
				; CHECK-NEXT: for (int c3 = 128 * c2; c3 <= 128 * c2 + 127; c3 += 1)
				; CHECK-NEXT: for (int c5 = 384 * c1; c5 <= min(1023, 384 * c1 + 383); c5 += 1)
				; CHECK-NEXT: CopyStmt_1(c3, 0, c5);
				; CHECK-NEXT: // 1st level tiling - Points
				; CHECK-NEXT: // Register tiling - Tiles
				; CHECK-NEXT: for (int c3 = 0; c3 <= 127; c3 += 1)
				; CHECK-NEXT: for (int c4 = 0; c4 <= 15; c4 += 1)
				; CHECK-NEXT: for (int c5 = 0; c5 <= min(383, -384 * c1 + 1023); c5 += 1) {
				; CHECK-NEXT: // Register tiling - Points
				; CHECK-NEXT: {
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4, 8 * c3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4, 8 * c3 + 1, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4, 8 * c3 + 2, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4, 8 * c3 + 3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4, 8 * c3 + 4, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4, 8 * c3 + 5, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4, 8 * c3 + 6, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4, 8 * c3 + 7, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 1, 8 * c3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 1, 8 * c3 + 1, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 1, 8 * c3 + 2, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 1, 8 * c3 + 3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 1, 8 * c3 + 4, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 1, 8 * c3 + 5, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 1, 8 * c3 + 6, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 1, 8 * c3 + 7, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 2, 8 * c3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 2, 8 * c3 + 1, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 2, 8 * c3 + 2, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 2, 8 * c3 + 3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 2, 8 * c3 + 4, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 2, 8 * c3 + 5, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 2, 8 * c3 + 6, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 2, 8 * c3 + 7, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 3, 8 * c3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 3, 8 * c3 + 1, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 3, 8 * c3 + 2, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 3, 8 * c3 + 3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 3, 8 * c3 + 4, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 3, 8 * c3 + 5, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 3, 8 * c3 + 6, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 3, 8 * c3 + 7, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 4, 8 * c3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 4, 8 * c3 + 1, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 4, 8 * c3 + 2, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 4, 8 * c3 + 3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 4, 8 * c3 + 4, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 4, 8 * c3 + 5, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 4, 8 * c3 + 6, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 4, 8 * c3 + 7, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 5, 8 * c3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 5, 8 * c3 + 1, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 5, 8 * c3 + 2, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 5, 8 * c3 + 3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 5, 8 * c3 + 4, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 5, 8 * c3 + 5, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 5, 8 * c3 + 6, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 5, 8 * c3 + 7, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 6, 8 * c3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 6, 8 * c3 + 1, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 6, 8 * c3 + 2, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 6, 8 * c3 + 3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 6, 8 * c3 + 4, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 6, 8 * c3 + 5, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 6, 8 * c3 + 6, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 6, 8 * c3 + 7, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 7, 8 * c3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 7, 8 * c3 + 1, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 7, 8 * c3 + 2, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 7, 8 * c3 + 3, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 7, 8 * c3 + 4, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 7, 8 * c3 + 5, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 7, 8 * c3 + 6, 384 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(128 * c2 + 8 * c4 + 7, 8 * c3 + 7, 384 * c1 + c5);
				; CHECK-NEXT: }
				; CHECK-NEXT: }
				; CHECK-NEXT: }
				; CHECK-NEXT: }
				;
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				; Function Attrs: noinline nounwind uwtable
				define internal void @kernel_gemm(i32 %ni, i32 %nj, i32 %nk, float %alpha, float %beta, [1024 x float]* %C, [1024 x float]* %A, [1024 x float]* %B) #0 {
				entry:
				br label %entry.split

				entry.split: ; preds = %entry
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc20, %entry.split
				%indvars.iv41 = phi i64 [ 0, %entry.split ], [ %indvars.iv.next42, %for.inc20 ]
				br label %for.cond4.preheader

				for.cond4.preheader: ; preds = %for.inc17, %for.cond1.preheader
				%indvars.iv38 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next39, %for.inc17 ]
				br label %for.body6

				for.body6: ; preds = %for.body6, %for.cond4.preheader
				%indvars.iv = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next, %for.body6 ]
				%arrayidx8 = getelementptr inbounds [1024 x float], [1024 x float]* %A, i64 %indvars.iv41, i64 %indvars.iv
				%tmp = load float, float* %arrayidx8, align 4
				%arrayidx12 = getelementptr inbounds [1024 x float], [1024 x float]* %B, i64 %indvars.iv, i64 %indvars.iv38
				%tmp1 = load float, float* %arrayidx12, align 4
				%mul = fmul float %tmp, %tmp1
				%arrayidx16 = getelementptr inbounds [1024 x float], [1024 x float]* %C, i64 %indvars.iv41, i64 %indvars.iv38
				%tmp2 = load float, float* %arrayidx16, align 4
				%add = fadd float %tmp2, %mul
				store float %add, float* %arrayidx16, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.body6, label %for.inc17

				for.inc17: ; preds = %for.body6
				%indvars.iv.next39 = add nuw nsw i64 %indvars.iv38, 1
				%exitcond40 = icmp ne i64 %indvars.iv.next39, 1024
				br i1 %exitcond40, label %for.cond4.preheader, label %for.inc20

				for.inc20: ; preds = %for.inc17
				%indvars.iv.next42 = add nuw nsw i64 %indvars.iv41, 1
				%exitcond43 = icmp ne i64 %indvars.iv.next42, 1024
				br i1 %exitcond43, label %for.cond1.preheader, label %for.end22

				for.end22: ; preds = %for.inc20
				ret void
				}

polly/trunk/test/ScheduleOptimizer/pattern-matching-based-opts_8.ll

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-target-throughput-vector-fma=1 \
				; RUN: -polly-target-latency-vector-fma=8 \
				; RUN: -analyze -polly-ast -polly-target-1st-cache-level-associativity=8 \
				; RUN: -polly-target-2nd-cache-level-associativity=8 \
				; RUN: -polly-target-1st-cache-level-size=32768 \
				; RUN: -polly-target-vector-register-bitwidth=256 \
				; RUN: -polly-target-2nd-cache-level-size=262144 < %s \
				; RUN: \| FileCheck %s
				;
				; /* C := A * B + C */
				; /* Elements of the matrices B, C have the double type. */
				; /* Elements of the matrix A have the float type. */
				; /* The type size of elements of the matrix multiplication operands is used
				; to determine the parameters of the code produced by the optimization
				; of the matrix multiplication (e.g. bounds of the loops of the loop
				; nest, the innermost loop body). This test checks the form of
				; the generated loop nest. See getMicroKernelParams and
				; getMacroKernelParams from lib/Transform/ScheduleOptimizer.cpp
				; for details. */
				; for (i = 0; i < _PB_NI; i++)
				; for (j = 0; j < _PB_NJ; j++)
				; for (k = 0; k < _PB_NK; ++k)
				; C[i][j] += A[i][k] * B[k][j];
				;
				; CHECK: // 1st level tiling - Tiles
				; CHECK-NEXT: for (int c1 = 0; c1 <= 3; c1 += 1) {
				; CHECK-NEXT: for (int c3 = 0; c3 <= 1023; c3 += 1)
				; CHECK-NEXT: for (int c4 = 256 * c1; c4 <= 256 * c1 + 255; c4 += 1)
				; CHECK-NEXT: CopyStmt_0(0, c3, c4);
				; CHECK-NEXT: for (int c2 = 0; c2 <= 10; c2 += 1) {
				; CHECK-NEXT: for (int c3 = 96 * c2; c3 <= min(1023, 96 * c2 + 95); c3 += 1)
				; CHECK-NEXT: for (int c5 = 256 * c1; c5 <= 256 * c1 + 255; c5 += 1)
				; CHECK-NEXT: CopyStmt_1(c3, 0, c5);
				; CHECK-NEXT: // 1st level tiling - Points
				; CHECK-NEXT: // Register tiling - Tiles
				; CHECK-NEXT: for (int c3 = 0; c3 <= 127; c3 += 1)
				; CHECK-NEXT: for (int c4 = 0; c4 <= min(23, -24 * c2 + 255); c4 += 1)
				; CHECK-NEXT: for (int c5 = 0; c5 <= 255; c5 += 1) {
				; CHECK-NEXT: // Register tiling - Points
				; CHECK-NEXT: {
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4, 8 * c3, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4, 8 * c3 + 1, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4, 8 * c3 + 2, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4, 8 * c3 + 3, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4, 8 * c3 + 4, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4, 8 * c3 + 5, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4, 8 * c3 + 6, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4, 8 * c3 + 7, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 1, 8 * c3, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 1, 8 * c3 + 1, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 1, 8 * c3 + 2, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 1, 8 * c3 + 3, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 1, 8 * c3 + 4, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 1, 8 * c3 + 5, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 1, 8 * c3 + 6, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 1, 8 * c3 + 7, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 2, 8 * c3, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 2, 8 * c3 + 1, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 2, 8 * c3 + 2, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 2, 8 * c3 + 3, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 2, 8 * c3 + 4, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 2, 8 * c3 + 5, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 2, 8 * c3 + 6, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 2, 8 * c3 + 7, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 3, 8 * c3, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 3, 8 * c3 + 1, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 3, 8 * c3 + 2, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 3, 8 * c3 + 3, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 3, 8 * c3 + 4, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 3, 8 * c3 + 5, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 3, 8 * c3 + 6, 256 * c1 + c5);
				; CHECK-NEXT: Stmt_for_body6(96 * c2 + 4 * c4 + 3, 8 * c3 + 7, 256 * c1 + c5);
				; CHECK-NEXT: }
				; CHECK-NEXT: }
				; CHECK-NEXT: }
				; CHECK-NEXT: }
				;
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				; Function Attrs: noinline nounwind uwtable
				define internal void @kernel_gemm(i32 %ni, i32 %nj, i32 %nk, double %alpha, double %beta, [1024 x double]* %C, [1024 x float]* %A, [1024 x double]* %B) #0 {
				entry:
				br label %entry.split

				entry.split: ; preds = %entry
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc20, %entry.split
				%indvars.iv41 = phi i64 [ 0, %entry.split ], [ %indvars.iv.next42, %for.inc20 ]
				br label %for.cond4.preheader

				for.cond4.preheader: ; preds = %for.inc17, %for.cond1.preheader
				%indvars.iv38 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next39, %for.inc17 ]
				br label %for.body6

				for.body6: ; preds = %for.body6, %for.cond4.preheader
				%indvars.iv = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next, %for.body6 ]
				%arrayidx8 = getelementptr inbounds [1024 x float], [1024 x float]* %A, i64 %indvars.iv41, i64 %indvars.iv
				%tmp = load float, float* %arrayidx8, align 4
				%conv = fpext float %tmp to double
				%arrayidx12 = getelementptr inbounds [1024 x double], [1024 x double]* %B, i64 %indvars.iv, i64 %indvars.iv38
				%tmp1 = load double, double* %arrayidx12, align 8
				%mul = fmul double %conv, %tmp1
				%arrayidx16 = getelementptr inbounds [1024 x double], [1024 x double]* %C, i64 %indvars.iv41, i64 %indvars.iv38
				%tmp2 = load double, double* %arrayidx16, align 8
				%add = fadd double %tmp2, %mul
				store double %add, double* %arrayidx16, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.body6, label %for.inc17

				for.inc17: ; preds = %for.body6
				%indvars.iv.next39 = add nuw nsw i64 %indvars.iv38, 1
				%exitcond40 = icmp ne i64 %indvars.iv.next39, 1024
				br i1 %exitcond40, label %for.cond4.preheader, label %for.inc20

				for.inc20: ; preds = %for.inc17
				%indvars.iv.next42 = add nuw nsw i64 %indvars.iv41, 1
				%exitcond43 = icmp ne i64 %indvars.iv.next42, 1024
				br i1 %exitcond43, label %for.cond1.preheader, label %for.end22

				for.end22: ; preds = %for.inc20
				ret void
				}