This is an archive of the discontinued LLVM Phabricator instance.

[GSoC 2016] [Polly] [WIP] Apply all necessary tilings and unrollings to get a micro-kernel
ClosedPublic

Authored by gareevroman on Jun 8 2016, 8:53 AM.

Download Raw Diff

Details

Reviewers

Meinersbur
grosser
jdoerfert

Commits

rG42402c9e89e4: Apply all necessary tilings and unrollings to get a micro-kernel
rPLO273397: Apply all necessary tilings and unrollings to get a micro-kernel
rL273397: Apply all necessary tilings and unrollings to get a micro-kernel

Summary

This is the first patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS micro-kernel by applying a combination of tiling and unrolling. In subsequent changes we will add the extraction of the BLIS macro-kernel and implement the packing transformation.

Diff Detail

Event Timeline

I haven't yet found out how to get throughput of vector instructions per clock cycle and latency of instructions That's why these values are passed as command line parameters. Maybe TargetTransformInfo::getArithmeticInstrCost can be used for this purpose. However, I haven't found an algorithm that is used by target architectures to compute a cost in TargetTransformInfoImpl.

Hi Roman,

the patch looks generally good. However, I have a couple of smaller comments.

Some comments to the commit message:

Does this already give some speedups for your kernel. In case it does, can you state the improvement in the commit message?
You already give an overview over what BLIS does in the commit message. Can you explicitly state which part this patch implements (you do this partially) and what the subsequent steps are. Something like:

This is the first patch to apply the BLIS matmul optimization pattern on matmul kernels (URL/reference). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS micro-kernel by applying a combination of tiling and unrolling. In subsequent changes we will .....

include/polly/ScheduleOptimizer.h
106	The first time I read this function name I got the impression you want to register a tile node. Maybe a name such as applyRegisterTiling() could help to avoid such issues.
109	that is used where?
111	Why Node is repeated here?
114	I would personally call this 'optimizeMatMulPattern'.
lib/Transform/ScheduleOptimizer.cpp
124	The option names are very cryptic. Can you spell them out to make them more understandable. Also, maybe add a prefix such as -polly-target-latency-vector-fma? And below -polly-target-througput-vector-fma (if these are the correct names)? Also, please use more descriptive variable names.
125	The minimal number of cycles between issuing two dependent .consecutive vector fused multiply-add instructions. Also you repeat here instructions twice.
134	The second part does not seem grammatically correct.
362	Outlining this function is a preparing transformation. I would suggest to commit this separately as NFC cleanup in preparation of this overall commit No additional pre-commit review needed for such a change.
404	Nice cleanup.
510	This clearly would benefit from a longer comment at the top of this function definition that describes what we are doing here, where we got the ideas from and where the cost functions are derived from.

Hi Tobias,

thank you for the comments! I’ve tried to address them in this version of the patch.

P.S.: I haven’t got the results of the nightly tests yet. However, I get the following numbers, if I try to compile gems:

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -DPOLYBENCH_USE_SCALAR_LB -march=native -Xclang -load -Xclang /tmp_home/compiled/llvm_d/lib/LLVMPolly.so -mllvm -polly -mllvm -polly-pattern-matching-based-opts=false -mllvm -polly-target-latency-vector-fma=8 -mllvm -polly-target-througput-vector-fma=1

0.750034

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -DPOLYBENCH_USE_SCALAR_LB -march=native -Xclang -load -Xclang /tmp_home/compiled/llvm_d/lib/LLVMPolly.so -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -mllvm -polly-target-latency-vector-fma=8 -mllvm -polly-target-througput-vector-fma=1

0.236387

LGTM.

lib/Transform/ScheduleOptimizer.cpp
134	The throughput of ...

This revision is now accepted and ready to land.Jun 12 2016, 11:03 AM

Hi Tobias,

in case of perf-x86_64-penryn-O3-polly (http://gcc12.fsffrance.org:8011/builders/perf-x86_64-penryn-O3-polly) we doesn't get speedups for the matmul kernels:

Performance-Regressions-Compile-Time	Δ	Previous	Current	σ
MultiSource/Benchmarks/VersaBench/bmm/bmm	32.14%	0.2240	0.2960	0.0060
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm	21.87%	0.5120	0.6240	0.0048
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm	20.29%	0.8280	0.9960	0.0062
SingleSource/Benchmarks/Polybench/datamining/covariance/covariance	14.05%	0.4840	0.5520	0.0048
SingleSource/Benchmarks/Polybench/datamining/correlation/correlation	14.04%	0.4560	0.5200	0.0053
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm	12.50%	0.2880	0.3240	0.0070

Performance Regressions - Execution Time	Δ	Previous	Current	σ
SingleSource/Benchmarks/Polybench/datamining/correlation/correlation	38.42%	0.7600	1.0520	0.0047
SingleSource/Benchmarks/Polybench/datamining/covariance/covariance	34.21%	0.7600	1.0200	0.0068
MultiSource/Benchmarks/SciMark2-C/scimark2	2.48%	63.1080	64.6720	0.0120

Performance-Improvements-Execution-Time	Δ	Previous	Current	σ
MultiSource/Benchmarks/VersaBench/bmm/bmm	-18.69%	2.2040	1.7920	0.0150

Maybe we should try to specify values of LatencyVectorFma and ThrougputVectorFma. Could you please advise me where I can find parameters of perf-x86_64-penryn-O3-polly? I think that the model name of its processor would be enough to determine LatencyVectorFma and ThrougputVectorFma.

I've also found out that we get compile time errors of SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu and SingleSource/Benchmarks/Polybench/stencils/adi/adi caused by the following failed assertion:

assert(isl_map_dim(NewPartialSchedule, isl_dim_out) == 3 && "Each schedule dimension should be represented by a union piecewise quasi-affine expression.");

Sorry that I didn't test it on a debug build. I'll try to fix the issue soon.

Closed by commit rL273397: Apply all necessary tilings and unrollings to get a micro-kernel (authored by romangareev). · Explain WhyJun 22 2016, 2:59 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

polly/

ScheduleOptimizer.h

35 lines

lib/

Transform/

ScheduleOptimizer.cpp

55 lines

test/

ScheduleOptimizer/

pattern-matching-based-opts_3.ll

128 lines

Diff 60470

include/polly/ScheduleOptimizer.h

//===------ polly/ScheduleOptimizer.h - The Schedule Optimizer - C++ --===//		//===------ polly/ScheduleOptimizer.h - The Schedule Optimizer - C++ --===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef POLLY_SCHEDULE_OPTIMIZER_H		#ifndef POLLY_SCHEDULE_OPTIMIZER_H
#define POLLY_SCHEDULE_OPTIMIZER_H		#define POLLY_SCHEDULE_OPTIMIZER_H

#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
		#include "llvm/Analysis/TargetTransformInfo.h"
#include "isl/ctx.h"		#include "isl/ctx.h"

struct isl_schedule;		struct isl_schedule;
struct isl_schedule_node;		struct isl_schedule_node;
struct isl_union_map;		struct isl_union_map;

namespace polly {		namespace polly {
extern bool DisablePollyTiling;		extern bool DisablePollyTiling;
class Scop;		class Scop;
}		}

class ScheduleTreeOptimizer {		class ScheduleTreeOptimizer {
public:		public:
/// @brief Apply schedule tree transformations.		/// @brief Apply schedule tree transformations.
///		///
/// This function takes an (possibly already optimized) schedule tree and		/// This function takes an (possibly already optimized) schedule tree and
/// applies a set of additional optimizations on the schedule tree. The		/// applies a set of additional optimizations on the schedule tree. The
/// transformations applied include:		/// transformations applied include:
///		///
/// - Tiling		/// - Tiling
/// - Prevectorization		/// - Prevectorization
///		///
/// @param Schedule The schedule object the transformations will be applied		/// @param Schedule The schedule object the transformations will be applied
/// to.		/// to.
		/// @param TTI Target Transform Info.
/// @returns The transformed schedule.		/// @returns The transformed schedule.
static __isl_give isl_schedule *		static __isl_give isl_schedule *
optimizeSchedule(__isl_take isl_schedule *Schedule);		optimizeSchedule(__isl_take isl_schedule *Schedule,
		const llvm::TargetTransformInfo *TTI = nullptr);

/// @brief Apply schedule tree transformations.		/// @brief Apply schedule tree transformations.
///		///
/// This function takes a node in an (possibly already optimized) schedule		/// This function takes a node in an (possibly already optimized) schedule
/// tree and applies a set of additional optimizations on this schedule tree		/// tree and applies a set of additional optimizations on this schedule tree
/// node and its descendents. The transformations applied include:		/// node and its descendents. The transformations applied include:
///		///
/// - Tiling		/// - Tiling
/// - Prevectorization		/// - Prevectorization
///		///
/// @param Node The schedule object post-transformations will be applied to.		/// @param Node The schedule object post-transformations will be applied to.
		/// @param TTI Target Transform Info.
/// @returns The transformed schedule.		/// @returns The transformed schedule.
static __isl_give isl_schedule_node *		static __isl_give isl_schedule_node *
optimizeScheduleNode(__isl_take isl_schedule_node *Node);		optimizeScheduleNode(__isl_take isl_schedule_node *Node,
		const llvm::TargetTransformInfo *TTI = nullptr);

/// @brief Decide if the @p NewSchedule is profitable for @p S.		/// @brief Decide if the @p NewSchedule is profitable for @p S.
///		///
/// @param S The SCoP we optimize.		/// @param S The SCoP we optimize.
/// @param NewSchedule The new schedule we computed.		/// @param NewSchedule The new schedule we computed.
///		///
/// @return True, if we believe @p NewSchedule is an improvement for @p S.		/// @return True, if we believe @p NewSchedule is an improvement for @p S.
static bool isProfitableSchedule(polly::Scop &S,		static bool isProfitableSchedule(polly::Scop &S,
Show All 28 Lines	private:
/// @brief Tile a schedule node and unroll point loops.		/// @brief Tile a schedule node and unroll point loops.
///		///
/// @param Node The node to register tile.		/// @param Node The node to register tile.
/// @param TileSizes A vector of tile sizes that should be used for		/// @param TileSizes A vector of tile sizes that should be used for
/// tiling.		/// tiling.
/// @param DefaultTileSize A default tile size that is used for dimensions		/// @param DefaultTileSize A default tile size that is used for dimensions
static __isl_give isl_schedule_node *		static __isl_give isl_schedule_node *
applyRegisterTiling(__isl_take isl_schedule_node *Node,		applyRegisterTiling(__isl_take isl_schedule_node *Node,
llvm::ArrayRef<int> TileSizes, int DefaultTileSize);		llvm::ArrayRef<int> TileSizes, int DefaultTileSize);
		grosserUnsubmitted Not Done Reply Inline Actions The first time I read this function name I got the impression you want to register a tile node. Maybe a name such as applyRegisterTiling() could help to avoid such issues. grosser: The first time I read this function name I got the impression you want to register a tile node.

		/// @brief Apply the BLIS matmul optimization pattern
		///
		grosserUnsubmitted Not Done Reply Inline Actions that is used where? grosser: that is used where?
		/// Apply the BLIS matmul optimization pattern
		/// (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf).
		grosserUnsubmitted Not Done Reply Inline Actions Why Node is repeated here? grosser: Why Node is repeated here?
		/// BLIS implements gemm as three nested loops around a macro-kernel,
		/// plus two packing routines. The macro-kernel is implemented in terms
		/// of two additional loops around a micro-kernel. The micro-kernel
		grosserUnsubmitted Not Done Reply Inline Actions I would personally call this 'optimizeMatMulPattern'. grosser: I would personally call this 'optimizeMatMulPattern'.
		/// is a loop around a rank-1 (i.e., outer product) update.
		///
		/// We create the BLIS micro-kernel by applying a combination of tiling
		/// and unrolling. In subsequent changes we will add the extraction
		/// of the BLIS macro-kernel and implement the packing transformation.
		///
		/// It is assumed that the Node is successfully checked
		/// by ScheduleTreeOptimizer::isMatrMultPattern. Consequently
		/// in case of matmul kernels the application of optimizeMatMulPattern
		/// can lead to close-to-peak performance. Maybe it can be generalized
		/// to effectively optimize the whole class of successfully checked
		/// statements.
		///
		/// @param Node the node that contains a band to be optimized.
		/// @return Modified isl_schedule_node.
		static __isl_give isl_schedule_node *
		optimizeMatMulPattern(__isl_take isl_schedule_node *Node,
		const llvm::TargetTransformInfo *TTI);

/// @brief Check if this node is a band node we want to tile.		/// @brief Check if this node is a band node we want to tile.
///		///
/// We look for innermost band nodes where individual dimensions are marked as		/// We look for innermost band nodes where individual dimensions are marked as
/// permutable.		/// permutable.
///		///
/// @param Node The node to check.		/// @param Node The node to check.
static bool isTileableBandNode(__isl_keep isl_schedule_node *Node);		static bool isTileableBandNode(__isl_keep isl_schedule_node *Node);

▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

lib/Transform/ScheduleOptimizer.cpp

Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines

#include "polly/ScheduleOptimizer.h"		#include "polly/ScheduleOptimizer.h"
#include "polly/CodeGen/CodeGeneration.h"		#include "polly/CodeGen/CodeGeneration.h"
#include "polly/DependenceInfo.h"		#include "polly/DependenceInfo.h"
#include "polly/LinkAllPasses.h"		#include "polly/LinkAllPasses.h"
#include "polly/Options.h"		#include "polly/Options.h"
#include "polly/ScopInfo.h"		#include "polly/ScopInfo.h"
#include "polly/Support/GICHelper.h"		#include "polly/Support/GICHelper.h"
		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "isl/aff.h"		#include "isl/aff.h"
#include "isl/band.h"		#include "isl/band.h"
#include "isl/constraint.h"		#include "isl/constraint.h"
#include "isl/map.h"		#include "isl/map.h"
#include "isl/options.h"		#include "isl/options.h"
#include "isl/printer.h"		#include "isl/printer.h"
#include "isl/schedule.h"		#include "isl/schedule.h"
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	cl::desc(
"The number of loop iterations to strip-mine for pre-vectorization"),		"The number of loop iterations to strip-mine for pre-vectorization"),
cl::Hidden, cl::init(4), cl::ZeroOrMore, cl::cat(PollyCategory));		cl::Hidden, cl::init(4), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<bool> FirstLevelTiling("polly-tiling",		static cl::opt<bool> FirstLevelTiling("polly-tiling",
cl::desc("Enable loop tiling"),		cl::desc("Enable loop tiling"),
cl::init(true), cl::ZeroOrMore,		cl::init(true), cl::ZeroOrMore,
cl::cat(PollyCategory));		cl::cat(PollyCategory));

		static cl::opt<int> LatencyVectorFma(
		"polly-target-latency-vector-fma",
		grosserUnsubmitted Not Done Reply Inline Actions The option names are very cryptic. Can you spell them out to make them more understandable. Also, maybe add a prefix such as -polly-target-latency-vector-fma? And below -polly-target-througput-vector-fma (if these are the correct names)? Also, please use more descriptive variable names. grosser: The option names are very cryptic. Can you spell them out to make them more understandable.
		cl::desc("The minimal number of cycles between issuing two "
		grosserUnsubmitted Not Done Reply Inline Actions The minimal number of cycles between issuing two dependent .consecutive vector fused multiply-add instructions. Also you repeat here instructions twice. grosser: The minimal number of cycles between issuing two dependent .consecutive vector fused multiply…
		"dependent consecutive vector fused multiply-add "
		"instructions."),
		cl::Hidden, cl::init(8), cl::ZeroOrMore, cl::cat(PollyCategory));

		static cl::opt<int> ThrougputVectorFma(
		"polly-target-througput-vector-fma",
		cl::desc("A throughput of the processor floating-point arithmetic units "
		"expressed in the number of vector fused multiply-add "
		"instructions per clock cycle."),
		grosserUnsubmitted Not Done Reply Inline Actions The second part does not seem grammatically correct. grosser: The second part does not seem grammatically correct.
		grosserUnsubmitted Not Done Reply Inline Actions The throughput of ... grosser: The throughput of ...
		cl::Hidden, cl::init(1), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<int> FirstLevelDefaultTileSize(		static cl::opt<int> FirstLevelDefaultTileSize(
"polly-default-tile-size",		"polly-default-tile-size",
cl::desc("The default tile size (if not enough were provided by"		cl::desc("The default tile size (if not enough were provided by"
" --polly-tile-sizes)"),		" --polly-tile-sizes)"),
cl::Hidden, cl::init(32), cl::ZeroOrMore, cl::cat(PollyCategory));		cl::Hidden, cl::init(32), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::list<int> FirstLevelTileSizes(		static cl::list<int> FirstLevelTileSizes(
"polly-tile-sizes", cl::desc("A tile size for each loop dimension, filled "		"polly-tile-sizes", cl::desc("A tile size for each loop dimension, filled "
▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines
ScheduleTreeOptimizer::applyRegisterTiling(__isl_take isl_schedule_node *Node,		ScheduleTreeOptimizer::applyRegisterTiling(__isl_take isl_schedule_node *Node,
llvm::ArrayRef<int> TileSizes,		llvm::ArrayRef<int> TileSizes,
int DefaultTileSize) {		int DefaultTileSize) {
auto *Ctx = isl_schedule_node_get_ctx(Node);		auto *Ctx = isl_schedule_node_get_ctx(Node);
Node = tileNode(Node, "Register tiling", TileSizes, DefaultTileSize);		Node = tileNode(Node, "Register tiling", TileSizes, DefaultTileSize);
Node = isl_schedule_node_band_set_ast_build_options(		Node = isl_schedule_node_band_set_ast_build_options(
Node, isl_union_set_read_from_str(Ctx, "{unroll[x]}"));		Node, isl_union_set_read_from_str(Ctx, "{unroll[x]}"));
return Node;		return Node;
}		}
		grosserUnsubmitted Not Done Reply Inline Actions Outlining this function is a preparing transformation. I would suggest to commit this separately as NFC cleanup in preparation of this overall commit No additional pre-commit review needed for such a change. grosser: Outlining this function is a preparing transformation. I would suggest to commit this…

bool ScheduleTreeOptimizer::isTileableBandNode(		bool ScheduleTreeOptimizer::isTileableBandNode(
__isl_keep isl_schedule_node *Node) {		__isl_keep isl_schedule_node *Node) {
if (isl_schedule_node_get_type(Node) != isl_schedule_node_band)		if (isl_schedule_node_get_type(Node) != isl_schedule_node_band)
return false;		return false;

if (isl_schedule_node_n_children(Node) != 1)		if (isl_schedule_node_n_children(Node) != 1)
return false;		return false;
Show All 25 Lines	if (FirstLevelTiling)
Node = tileNode(Node, "1st level tiling", FirstLevelTileSizes,		Node = tileNode(Node, "1st level tiling", FirstLevelTileSizes,
FirstLevelDefaultTileSize);		FirstLevelDefaultTileSize);

if (SecondLevelTiling)		if (SecondLevelTiling)
Node = tileNode(Node, "2nd level tiling", SecondLevelTileSizes,		Node = tileNode(Node, "2nd level tiling", SecondLevelTileSizes,
SecondLevelDefaultTileSize);		SecondLevelDefaultTileSize);

if (RegisterTiling)		if (RegisterTiling)
Node =		Node =
		grosserUnsubmitted Not Done Reply Inline Actions Nice cleanup. grosser: Nice cleanup.
applyRegisterTiling(Node, RegisterTileSizes, RegisterDefaultTileSize);		applyRegisterTiling(Node, RegisterTileSizes, RegisterDefaultTileSize);

if (PollyVectorizerChoice == VECTORIZER_NONE)		if (PollyVectorizerChoice == VECTORIZER_NONE)
return Node;		return Node;

auto Space = isl_schedule_node_band_get_space(Node);		auto Space = isl_schedule_node_band_get_space(Node);
auto Dims = isl_space_dim(Space, isl_dim_set);		auto Dims = isl_space_dim(Space, isl_dim_set);
isl_space_free(Space);		isl_space_free(Space);
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	static __isl_give isl_map circularShiftOutputDims(__isl_take isl_map IslMap) {
if (DimNum == 0)		if (DimNum == 0)
return IslMap;		return IslMap;
auto InputDimsId = isl_map_get_tuple_id(IslMap, isl_dim_in);		auto InputDimsId = isl_map_get_tuple_id(IslMap, isl_dim_in);
IslMap = isl_map_move_dims(IslMap, isl_dim_in, 0, isl_dim_out, DimNum - 1, 1);		IslMap = isl_map_move_dims(IslMap, isl_dim_in, 0, isl_dim_out, DimNum - 1, 1);
IslMap = isl_map_move_dims(IslMap, isl_dim_out, 0, isl_dim_in, 0, 1);		IslMap = isl_map_move_dims(IslMap, isl_dim_out, 0, isl_dim_in, 0, 1);
return isl_map_set_tuple_id(IslMap, isl_dim_in, InputDimsId);		return isl_map_set_tuple_id(IslMap, isl_dim_in, InputDimsId);
}		}

		__isl_give isl_schedule_node *ScheduleTreeOptimizer::optimizeMatMulPattern(
		__isl_take isl_schedule_node Node, const llvm::TargetTransformInfo TTI) {
		assert(TTI && "The target transform info should be provided.");
		// Get a micro-kernel.
		// Nvec - Number of double-precision floating-point numbers that can be hold
		// by a vector register. Use 2 by default.
		auto Nvec = TTI->getRegisterBitWidth(true) / 64;
		if (Nvec == 0)
		Nvec = 2;
		int Nr =
		ceil(sqrt(Nvec * LatencyVectorFma * ThrougputVectorFma) / Nvec) * Nvec;
		int Mr = ceil(Nvec * LatencyVectorFma * ThrougputVectorFma / Nr);
		std::vector<int> MicroKernelParams{Mr, Nr};
		Node = applyRegisterTiling(Node, MicroKernelParams, 1);
		return Node;
		grosserUnsubmitted Not Done Reply Inline Actions This clearly would benefit from a longer comment at the top of this function definition that describes what we are doing here, where we got the ideas from and where the cost functions are derived from. grosser: This clearly would benefit from a longer comment at the top of this function definition that…
		}

bool ScheduleTreeOptimizer::isMatrMultPattern(		bool ScheduleTreeOptimizer::isMatrMultPattern(
__isl_keep isl_schedule_node *Node) {		__isl_keep isl_schedule_node *Node) {
auto *PartialSchedule =		auto *PartialSchedule =
isl_schedule_node_band_get_partial_schedule_union_map(Node);		isl_schedule_node_band_get_partial_schedule_union_map(Node);
if (isl_union_map_n_map(PartialSchedule) != 1)		if (isl_union_map_n_map(PartialSchedule) != 1)
return false;		return false;
auto *NewPartialSchedule = isl_map_from_union_map(PartialSchedule);		auto *NewPartialSchedule = isl_map_from_union_map(PartialSchedule);
auto DimNum = isl_map_dim(NewPartialSchedule, isl_dim_in);		auto DimNum = isl_map_dim(NewPartialSchedule, isl_dim_in);
Show All 14 Lines
}		}

__isl_give isl_schedule_node *		__isl_give isl_schedule_node *
ScheduleTreeOptimizer::optimizeBand(__isl_take isl_schedule_node *Node,		ScheduleTreeOptimizer::optimizeBand(__isl_take isl_schedule_node *Node,
void *User) {		void *User) {
if (!isTileableBandNode(Node))		if (!isTileableBandNode(Node))
return Node;		return Node;

if (PMBasedOpts && isMatrMultPattern(Node))		if (PMBasedOpts && User && isMatrMultPattern(Node)) {
DEBUG(dbgs() << "The matrix multiplication pattern was detected\n");		DEBUG(dbgs() << "The matrix multiplication pattern was detected\n");
		const llvm::TargetTransformInfo *TTI;
		TTI = static_cast<const llvm::TargetTransformInfo *>(User);
		Node = optimizeMatMulPattern(Node, TTI);
		}

return standardBandOpts(Node, User);		return standardBandOpts(Node, User);
}		}

__isl_give isl_schedule *		__isl_give isl_schedule *
ScheduleTreeOptimizer::optimizeSchedule(__isl_take isl_schedule *Schedule) {		ScheduleTreeOptimizer::optimizeSchedule(__isl_take isl_schedule *Schedule,
		const llvm::TargetTransformInfo *TTI) {
isl_schedule_node *Root = isl_schedule_get_root(Schedule);		isl_schedule_node *Root = isl_schedule_get_root(Schedule);
Root = optimizeScheduleNode(Root);		Root = optimizeScheduleNode(Root, TTI);
isl_schedule_free(Schedule);		isl_schedule_free(Schedule);
auto S = isl_schedule_node_get_schedule(Root);		auto S = isl_schedule_node_get_schedule(Root);
isl_schedule_node_free(Root);		isl_schedule_node_free(Root);
return S;		return S;
}		}

__isl_give isl_schedule_node *ScheduleTreeOptimizer::optimizeScheduleNode(		__isl_give isl_schedule_node *ScheduleTreeOptimizer::optimizeScheduleNode(
__isl_take isl_schedule_node *Node) {		__isl_take isl_schedule_node Node, const llvm::TargetTransformInfo TTI) {
Node = isl_schedule_node_map_descendant_bottom_up(Node, optimizeBand, NULL);		Node = isl_schedule_node_map_descendant_bottom_up(
		Node, optimizeBand, const_cast<void >(static_cast<const void >(TTI)));
return Node;		return Node;
}		}

bool ScheduleTreeOptimizer::isProfitableSchedule(		bool ScheduleTreeOptimizer::isProfitableSchedule(
Scop &S, __isl_keep isl_union_map *NewSchedule) {		Scop &S, __isl_keep isl_union_map *NewSchedule) {
// To understand if the schedule has been optimized we check if the schedule		// To understand if the schedule has been optimized we check if the schedule
// has changed at all.		// has changed at all.
// TODO: We can improve this by tracking if any necessarily beneficial		// TODO: We can improve this by tracking if any necessarily beneficial
▲ Show 20 Lines • Show All 171 Lines • ▼ Show 20 Lines	bool IslScheduleOptimizer::runOnScop(Scop &S) {
DEBUG({		DEBUG({
auto *P = isl_printer_to_str(S.getIslCtx());		auto *P = isl_printer_to_str(S.getIslCtx());
P = isl_printer_set_yaml_style(P, ISL_YAML_STYLE_BLOCK);		P = isl_printer_set_yaml_style(P, ISL_YAML_STYLE_BLOCK);
P = isl_printer_print_schedule(P, Schedule);		P = isl_printer_print_schedule(P, Schedule);
dbgs() << "NewScheduleTree: \n" << isl_printer_get_str(P) << "\n";		dbgs() << "NewScheduleTree: \n" << isl_printer_get_str(P) << "\n";
isl_printer_free(P);		isl_printer_free(P);
});		});

isl_schedule *NewSchedule = ScheduleTreeOptimizer::optimizeSchedule(Schedule);		Function &F = S.getFunction();
		auto *TTI = &getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
		isl_schedule *NewSchedule =
		ScheduleTreeOptimizer::optimizeSchedule(Schedule, TTI);
isl_union_map *NewScheduleMap = isl_schedule_get_map(NewSchedule);		isl_union_map *NewScheduleMap = isl_schedule_get_map(NewSchedule);

if (!ScheduleTreeOptimizer::isProfitableSchedule(S, NewScheduleMap)) {		if (!ScheduleTreeOptimizer::isProfitableSchedule(S, NewScheduleMap)) {
isl_union_map_free(NewScheduleMap);		isl_union_map_free(NewScheduleMap);
isl_schedule_free(NewSchedule);		isl_schedule_free(NewSchedule);
return false;		return false;
}		}

Show All 21 Lines	void IslScheduleOptimizer::printScop(raw_ostream &OS, Scop &) const {
isl_printer_free(p);		isl_printer_free(p);

OS << ScheduleStr << "\n";		OS << ScheduleStr << "\n";
}		}

void IslScheduleOptimizer::getAnalysisUsage(AnalysisUsage &AU) const {		void IslScheduleOptimizer::getAnalysisUsage(AnalysisUsage &AU) const {
ScopPass::getAnalysisUsage(AU);		ScopPass::getAnalysisUsage(AU);
AU.addRequired<DependenceInfo>();		AU.addRequired<DependenceInfo>();
		AU.addRequired<TargetTransformInfoWrapperPass>();
}		}

Pass *polly::createIslScheduleOptimizerPass() {		Pass *polly::createIslScheduleOptimizerPass() {
return new IslScheduleOptimizer();		return new IslScheduleOptimizer();
}		}

INITIALIZE_PASS_BEGIN(IslScheduleOptimizer, "polly-opt-isl",		INITIALIZE_PASS_BEGIN(IslScheduleOptimizer, "polly-opt-isl",
"Polly - Optimize schedule of SCoP", false, false);		"Polly - Optimize schedule of SCoP", false, false);
INITIALIZE_PASS_DEPENDENCY(DependenceInfo);		INITIALIZE_PASS_DEPENDENCY(DependenceInfo);
INITIALIZE_PASS_DEPENDENCY(ScopInfoRegionPass);		INITIALIZE_PASS_DEPENDENCY(ScopInfoRegionPass);
		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass);
INITIALIZE_PASS_END(IslScheduleOptimizer, "polly-opt-isl",		INITIALIZE_PASS_END(IslScheduleOptimizer, "polly-opt-isl",
"Polly - Optimize schedule of SCoP", false, false)		"Polly - Optimize schedule of SCoP", false, false)

test/ScheduleOptimizer/pattern-matching-based-opts_3.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -polly-target-througput-vector-fma=1 -polly-target-latency-vector-fma=8 -analyze -polly-ast < %s 2>&1 \| FileCheck %s
				;
				; /* C := alphaAB + betaC /
				; for (i = 0; i < _PB_NI; i++)
				; for (j = 0; j < _PB_NJ; j++)
				; {
				; C[i][j] *= beta;
				; for (k = 0; k < _PB_NK; ++k)
				; C[i][j] += alpha * A[i][k] * B[k][j];
				; }
				;
				; CHECK: {
				; CHECK: // 1st level tiling - Tiles
				; CHECK: for (int c0 = 0; c0 <= 32; c0 += 1)
				; CHECK: for (int c1 = 0; c1 <= 32; c1 += 1) {
				; CHECK: // 1st level tiling - Points
				; CHECK: for (int c2 = 0; c2 <= 31; c2 += 1)
				; CHECK: for (int c3 = 0; c3 <= 31; c3 += 1)
				; CHECK: Stmt_bb14(32 * c0 + c2, 32 * c1 + c3);
				; CHECK: }
				; CHECK: // Register tiling - Tiles
				; CHECK: for (int c0 = 0; c0 <= 263; c0 += 1)
				; CHECK: for (int c1 = 0; c1 <= 131; c1 += 1)
				; CHECK: for (int c2 = 0; c2 <= 1023; c2 += 1) {
				; CHECK: // Register tiling - Points
				; CHECK: // 1st level tiling - Tiles
				; CHECK: // 1st level tiling - Points
				; CHECK: {
				; CHECK: Stmt_bb24(4 * c0, 8 * c1, c2);
				; CHECK: Stmt_bb24(4 * c0, 8 * c1 + 1, c2);
				; CHECK: Stmt_bb24(4 * c0, 8 * c1 + 2, c2);
				; CHECK: Stmt_bb24(4 * c0, 8 * c1 + 3, c2);
				; CHECK: Stmt_bb24(4 * c0, 8 * c1 + 4, c2);
				; CHECK: Stmt_bb24(4 * c0, 8 * c1 + 5, c2);
				; CHECK: Stmt_bb24(4 * c0, 8 * c1 + 6, c2);
				; CHECK: Stmt_bb24(4 * c0, 8 * c1 + 7, c2);
				; CHECK: Stmt_bb24(4 * c0 + 1, 8 * c1, c2);
				; CHECK: Stmt_bb24(4 * c0 + 1, 8 * c1 + 1, c2);
				; CHECK: Stmt_bb24(4 * c0 + 1, 8 * c1 + 2, c2);
				; CHECK: Stmt_bb24(4 * c0 + 1, 8 * c1 + 3, c2);
				; CHECK: Stmt_bb24(4 * c0 + 1, 8 * c1 + 4, c2);
				; CHECK: Stmt_bb24(4 * c0 + 1, 8 * c1 + 5, c2);
				; CHECK: Stmt_bb24(4 * c0 + 1, 8 * c1 + 6, c2);
				; CHECK: Stmt_bb24(4 * c0 + 1, 8 * c1 + 7, c2);
				; CHECK: Stmt_bb24(4 * c0 + 2, 8 * c1, c2);
				; CHECK: Stmt_bb24(4 * c0 + 2, 8 * c1 + 1, c2);
				; CHECK: Stmt_bb24(4 * c0 + 2, 8 * c1 + 2, c2);
				; CHECK: Stmt_bb24(4 * c0 + 2, 8 * c1 + 3, c2);
				; CHECK: Stmt_bb24(4 * c0 + 2, 8 * c1 + 4, c2);
				; CHECK: Stmt_bb24(4 * c0 + 2, 8 * c1 + 5, c2);
				; CHECK: Stmt_bb24(4 * c0 + 2, 8 * c1 + 6, c2);
				; CHECK: Stmt_bb24(4 * c0 + 2, 8 * c1 + 7, c2);
				; CHECK: Stmt_bb24(4 * c0 + 3, 8 * c1, c2);
				; CHECK: Stmt_bb24(4 * c0 + 3, 8 * c1 + 1, c2);
				; CHECK: Stmt_bb24(4 * c0 + 3, 8 * c1 + 2, c2);
				; CHECK: Stmt_bb24(4 * c0 + 3, 8 * c1 + 3, c2);
				; CHECK: Stmt_bb24(4 * c0 + 3, 8 * c1 + 4, c2);
				; CHECK: Stmt_bb24(4 * c0 + 3, 8 * c1 + 5, c2);
				; CHECK: Stmt_bb24(4 * c0 + 3, 8 * c1 + 6, c2);
				; CHECK: Stmt_bb24(4 * c0 + 3, 8 * c1 + 7, c2);
				; CHECK: }
				; CHECK: }
				; CHECK: }
				;
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				define internal void @kernel_gemm(i32 %arg, i32 %arg1, i32 %arg2, double %arg3, double %arg4, [1056 x double]* %arg5, [1024 x double]* %arg6, [1056 x double]* %arg7) #0 {
				bb:
				br label %bb8

				bb8: ; preds = %bb39, %bb
				%tmp = phi i32 [ 0, %bb ], [ %tmp40, %bb39 ]
				%tmp9 = icmp slt i32 %tmp, 1056
				br i1 %tmp9, label %bb10, label %bb41

				bb10: ; preds = %bb8
				br label %bb11

				bb11: ; preds = %bb37, %bb10
				%tmp12 = phi i32 [ 0, %bb10 ], [ %tmp38, %bb37 ]
				%tmp13 = icmp slt i32 %tmp12, 1056
				br i1 %tmp13, label %bb14, label %bb39

				bb14: ; preds = %bb11
				%tmp15 = sext i32 %tmp12 to i64
				%tmp16 = sext i32 %tmp to i64
				%tmp17 = getelementptr inbounds [1056 x double], [1056 x double]* %arg5, i64 %tmp16
				%tmp18 = getelementptr inbounds [1056 x double], [1056 x double]* %tmp17, i64 0, i64 %tmp15
				%tmp19 = load double, double* %tmp18, align 8
				%tmp20 = fmul double %tmp19, %arg4
				store double %tmp20, double* %tmp18, align 8
				br label %bb21

				bb21: ; preds = %bb24, %bb14
				%tmp22 = phi i32 [ 0, %bb14 ], [ %tmp36, %bb24 ]
				%tmp23 = icmp slt i32 %tmp22, 1024
				br i1 %tmp23, label %bb24, label %bb37

				bb24: ; preds = %bb21
				%tmp25 = sext i32 %tmp22 to i64
				%tmp26 = getelementptr inbounds [1024 x double], [1024 x double]* %arg6, i64 %tmp16
				%tmp27 = getelementptr inbounds [1024 x double], [1024 x double]* %tmp26, i64 0, i64 %tmp25
				%tmp28 = load double, double* %tmp27, align 8
				%tmp29 = fmul double %arg3, %tmp28
				%tmp30 = getelementptr inbounds [1056 x double], [1056 x double]* %arg7, i64 %tmp25
				%tmp31 = getelementptr inbounds [1056 x double], [1056 x double]* %tmp30, i64 0, i64 %tmp15
				%tmp32 = load double, double* %tmp31, align 8
				%tmp33 = fmul double %tmp29, %tmp32
				%tmp34 = load double, double* %tmp18, align 8
				%tmp35 = fadd double %tmp34, %tmp33
				store double %tmp35, double* %tmp18, align 8
				%tmp36 = add nsw i32 %tmp22, 1
				br label %bb21

				bb37: ; preds = %bb21
				%tmp38 = add nsw i32 %tmp12, 1
				br label %bb11

				bb39: ; preds = %bb11
				%tmp40 = add nsw i32 %tmp, 1
				br label %bb8

				bb41: ; preds = %bb8
				ret void
				}

				attributes #0 = { nounwind uwtable "target-cpu"="x86-64" "target-features"="+aes,+avx,+cmov,+cx16,+fxsr,+mmx,+pclmul,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }