This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
polly/
-
include/polly/CodeGen/
-
polly/
-
CodeGen/
-
IslAst.h
-
lib/
-
CodeGen/
-
IslAst.cpp
-
Transform/
-
MatmulOptimizer.cpp

Differential D125202

[Polly] Disable matmul pattern-match + -polly-parallel
Needs RevisionPublic

Authored by bollu on May 8 2022, 8:42 PM.

Download Raw Diff

Details

Reviewers

grosser
Meinersbur

Summary

The two interact poorly and lead to miscompiles
(see: https://groups.google.com/g/polly-dev/c/gYyPbg5yWOs/m/dPyhM_w-AwAJ).
We disable the matrix multiplication pattern match if polly-parallel
is supplied.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bollu created this revision.May 8 2022, 8:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 8 2022, 8:42 PM

Herald added a subscriber: arjunp. · View Herald Transcript

bollu requested review of this revision.May 8 2022, 8:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 8 2022, 8:42 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B163412: Diff 427972.May 8 2022, 8:52 PM

This looks good from my side. However, maybe Michael Kruse has further remarks?

This revision is now accepted and ready to land.May 8 2022, 8:55 PM

This fixes the problem:

diff --git a/polly/lib/Transform/ScheduleOptimizer.cpp b/polly/lib/Transform/ScheduleOptimizer.cpp
index fad62d9b20830..f2cf06af2f7c8 100644
--- a/polly/lib/Transform/ScheduleOptimizer.cpp
+++ b/polly/lib/Transform/ScheduleOptimizer.cpp
@@ -978,7 +978,7 @@ runIslScheduleOptimizerUsingNPM(Scop &S, ScopAnalysisManager &SAM,
   OptimizationRemarkEmitter ORE(&S.getFunction());
   TargetTransformInfo *TTI = &SAR.TTI;
   isl::schedule LastSchedule;
-  bool Modified = runIslScheduleOptimizer(S, GetDeps, TTI, &ORE, LastSchedule);
+  runIslScheduleOptimizer(S, GetDeps, TTI, &ORE, LastSchedule);
   if (OS) {
     *OS << "Printing analysis 'Polly - Optimize schedule of SCoP' for region: '"
         << S.getName() << "' in function '" << S.getFunction().getName()
@@ -986,13 +986,11 @@ runIslScheduleOptimizerUsingNPM(Scop &S, ScopAnalysisManager &SAM,
     runScheduleOptimizerPrinter(*OS, LastSchedule);
   }

-  if (!Modified)
-    return PreservedAnalyses::all();
-
   PreservedAnalyses PA;
   PA.preserveSet<AllAnalysesOn<Module>>();
   PA.preserveSet<AllAnalysesOn<Function>>();
   PA.preserveSet<AllAnalysesOn<Loop>>();
+  PA.abandon<DependenceAnalysis>();
   return PA;
 }

The optimization adds new statements without updating (or invalidating) DependenceInfo. Unfortunately this now means that DependenceAnalysis will run a second time for AstInfo. Also consider adding a comment to why DependenceAnalysis is abandoned here and the legacy pass manager. Maybe you find a way to only invalidate it when matmul optimization has been applied.

We should also discuss about

diff --git a/polly/lib/Transform/MatmulOptimizer.cpp b/polly/lib/Transform/MatmulOptimizer.cpp
index 60dd9eda3c2c0..c46025522bc23 100644
--- a/polly/lib/Transform/MatmulOptimizer.cpp
+++ b/polly/lib/Transform/MatmulOptimizer.cpp
@@ -491,9 +491,6 @@ createMacroKernel(isl::schedule_node Node,
   Node = permuteBandNodeDimensions(Node, DimOutNum - 2, DimOutNum - 1);
   Node = permuteBandNodeDimensions(Node, DimOutNum - 3, DimOutNum - 1);

-  // Mark the outermost loop as parallelizable.
-  Node = Node.as<isl::schedule_node_band>().member_set_coincident(0, true);
-
   return Node.child(0).child(0);
 }

I added this in aa8a976174c7ac08676bbc7bb647f6bc0efd2e72 and I think it does not actually make anything parallel, but I am not sure it is actually allowed due to Packed_A shared between all the threads.

This revision now requires changes to proceed.May 9 2022, 7:11 PM

This is excellent. Thank you Michael!

I added this in aa8a976174c7ac08676bbc7bb647f6bc0efd2e72 and I think it does not actually make anything parallel, but I am not sure it is actually allowed due to Packed_A shared between all the threads.

As far as I know, parallelizing the outermost loop may be a good idea only on mulitsocket systems, where each CPU has a separate L3 cache (please, see [1]). Additionally, it would require to replicate Packed_A and Packed_B to avoid the race condition by using OpenMP parallelism.

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

If the L2 cache is not shared and the second loop is parallelized, then the elements of the created Packed_A are duplicated across the L2 caches. In this case, the first loop around the macro-kernel, can be considered. If we parallelize this loop, then each thread will be assigned different elements of the matrix A, which reside in the L2 cache, and they are packed into the created array Packed_A, where this may cause a race condition. So, replication of elements of Packed_A would require.

Probably, the best option is to add additional flag that specifies whether L2 cache is shared. Depending on it, we can use different parallelization strategies. I think such a flag should be set to true by default.

Refs.:

[1] - Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, Washington, DC, 1049–1059. DOI:http://dx.doi. org/10.1109/IPDPS.2014.110

[2] - Roman Gareev, Tobias Grosser, and Michael Kruse. 2018. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach. ACM Trans. Archit. Code Optim. 15, 3, Article 34 (August 2018), 27 pages. https://doi.org/10.1145/3235029

In D125202#3514940, @gareevroman wrote:

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

I fear that $loop_4$ does not have enough work to justify the parallelization overhead. Also, there will be false sharing between cache lines. It could be reduced by having the #pragma omp parallel outside the matrix multiplication, and only #pragma omp for on $loop_4$. However, Polly does not support that yet.

The usual candidate for coarse-grain parallelization is always the outermost one, unless we want to exploit a shared cache but that would be optional. We'd divide the packed array size equally between threads.

In D125202#3517374, @Meinersbur wrote:

In D125202#3514940, @gareevroman wrote:

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

I fear that $loop_4$ does not have enough work to justify the parallelization overhead.

Such a strategy is used in many hand tuned libraries that contain implementations of matrix multiplication. In general case, it helps to obtain good results.

Also, there will be false sharing between cache lines. It could be reduced by having the #pragma omp parallel outside the matrix multiplication, and only #pragma omp for on $loop_4$. However, Polly does not support that yet.

Could you elaborate on why there will be false sharing between cache lines?

The usual candidate for coarse-grain parallelization is always the outermost one, unless we want to exploit a shared cache but that would be optional. We'd divide the packed array size equally between threads.

Goto's algorithm for matrix multiplication, which is implemented in BLIS, OpenBLAS, and, in particular, Polly, is based on the effective usage of the cache hierarchy. So, I would propose to exploit the knowledge about the data usage pattern eventually. In my experience, the parallelization of the loop $loop_1$ produces a worser performance than the parallelization of the loop $loop_4$ in many cases.

In D125202#3530895, @gareevroman wrote:

In D125202#3517374, @Meinersbur wrote:

In D125202#3514940, @gareevroman wrote:

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

I fear that $loop_4$ does not have enough work to justify the parallelization overhead.

Such a strategy is used in many hand tuned libraries that contain implementations of matrix multiplication. In general case, it helps to obtain good results.

Do you have measurements?

Also, there will be false sharing between cache lines. It could be reduced by having the #pragma omp parallel outside the matrix multiplication, and only #pragma omp for on $loop_4$. However, Polly does not support that yet.

Could you elaborate on why there will be false sharing between cache lines?

The innermost kernel:
C[..][j_c] += ...

When parallelizing the j_c loop,
C[..][j_c+N_r - 1] and C[..][j_c+N_r] might be in the same cache line, but processed by different threads. To avoid, we'd need to ensure that N_r is larger than the cache lines of and levels and also align to that size.

The usual candidate for coarse-grain parallelization is always the outermost one, unless we want to exploit a shared cache but that would be optional. We'd divide the packed array size equally between threads.

Goto's algorithm for matrix multiplication, which is implemented in BLIS, OpenBLAS, and, in particular, Polly, is based on the effective usage of the cache hierarchy. So, I would propose to exploit the knowledge about the data usage pattern eventually. In my experience, the parallelization of the loop $loop_1$ produces a worser performance than the parallelization of the loop $loop_4$ in many cases.

Do you have measurements?

Polly inserts #pragma omp parallel for at the the parallized which introduces thread fork-and-join overhead. It would be cheaper to have just one #pragma omp parallel at the outermost level (to spawn threads) and only #pragma omp for for $loop_4$. Polly currently does not support this.

@bollu Do you intend to update this patch?

Meinersbur mentioned this in rG6fa65f8a9896: [Polly][MatMul] Abandon dependence analysis..Jun 29 2022, 3:21 PM

DependenceInfo is now abandoned after matmul optimization (rG6fa65f8a98967a5d2d2a6863e0f67a40d2961905)

Revision Contents

Path

Size

polly/

include/

polly/

CodeGen/

IslAst.h

3 lines

lib/

CodeGen/

IslAst.cpp

10 lines

Transform/

MatmulOptimizer.cpp

6 lines

Diff 427972

polly/include/polly/CodeGen/IslAst.h

	Show All 20 Lines
	#ifndef POLLY_ISLAST_H			#ifndef POLLY_ISLAST_H
	#define POLLY_ISLAST_H			#define POLLY_ISLAST_H

	#include "polly/ScopPass.h"			#include "polly/ScopPass.h"
	#include "llvm/ADT/SmallPtrSet.h"			#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/IR/PassManager.h"			#include "llvm/IR/PassManager.h"
	#include "isl/ctx.h"			#include "isl/ctx.h"

				/// Command line switch whether to model read-only accesses.
				extern bool PollyParallel;

	namespace polly {			namespace polly {
	using llvm::SmallPtrSet;			using llvm::SmallPtrSet;

	struct Dependences;			struct Dependences;

	class IslAst {			class IslAst {
	public:			public:
	IslAst(const IslAst &) = delete;			IslAst(const IslAst &) = delete;
	▲ Show 20 Lines • Show All 181 Lines • Show Last 20 Lines

polly/lib/CodeGen/IslAst.cpp

	Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines

	#define DEBUG_TYPE "polly-ast"			#define DEBUG_TYPE "polly-ast"

	using namespace llvm;			using namespace llvm;
	using namespace polly;			using namespace polly;

	using IslAstUserPayload = IslAstInfo::IslAstUserPayload;			using IslAstUserPayload = IslAstInfo::IslAstUserPayload;

	static cl::opt<bool>			bool PollyParallel;
	PollyParallel("polly-parallel",			static cl::opt<bool, true>
				XPollyParallel("polly-parallel",
	cl::desc("Generate thread parallel code (isl codegen only)"),			cl::desc("Generate thread parallel code (isl codegen only)"),
	cl::init(false), cl::ZeroOrMore, cl::cat(PollyCategory));			cl::ZeroOrMore, cl::cat(PollyCategory),
				cl::location(PollyParallel), cl::init(false));

	static cl::opt<bool> PrintAccesses("polly-ast-print-accesses",			static cl::opt<bool> PrintAccesses("polly-ast-print-accesses",
	cl::desc("Print memory access functions"),			cl::desc("Print memory access functions"),
	cl::init(false), cl::ZeroOrMore,			cl::init(false), cl::ZeroOrMore,
	cl::cat(PollyCategory));			cl::cat(PollyCategory));

	static cl::opt<bool> PollyParallelForce(			static cl::opt<bool> PollyParallelForce(
	"polly-parallel-force",			"polly-parallel-force",
	▲ Show 20 Lines • Show All 798 Lines • Show Last 20 Lines

polly/lib/Transform/MatmulOptimizer.cpp

	//===- MatmulOptimizer.cpp -----------------------------------------------===//			//===- MatmulOptimizer.cpp -----------------------------------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "polly/MatmulOptimizer.h"			#include "polly/MatmulOptimizer.h"
				#include "polly/CodeGen/IslAst.h"
	#include "polly/DependenceInfo.h"			#include "polly/DependenceInfo.h"
	#include "polly/Options.h"			#include "polly/Options.h"
	#include "polly/ScheduleTreeTransform.h"			#include "polly/ScheduleTreeTransform.h"
	#include "polly/ScopInfo.h"			#include "polly/ScopInfo.h"
	#include "polly/ScopPass.h"			#include "polly/ScopPass.h"
	#include "polly/Simplify.h"			#include "polly/Simplify.h"
	#include "polly/Support/ISLTools.h"			#include "polly/Support/ISLTools.h"
	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
	▲ Show 20 Lines • Show All 1,008 Lines • ▼ Show 20 Lines

	} // namespace			} // namespace

	isl::schedule_node			isl::schedule_node
	polly::tryOptimizeMatMulPattern(isl::schedule_node Node,			polly::tryOptimizeMatMulPattern(isl::schedule_node Node,
	const llvm::TargetTransformInfo *TTI,			const llvm::TargetTransformInfo *TTI,
	const Dependences *D) {			const Dependences *D) {
	MatMulInfoTy MMI;			MatMulInfoTy MMI;
	if (isMatrMultPattern(Node, D, MMI)) {			// FIXME: Polly's parallel scheduler and the matrix multiplication
				// pattern match produce invalid code. Disable the matrix pattern
				// entirely if `-polly-parallel` is used.
				if (!PollyParallel && isMatrMultPattern(Node, D, MMI)) {
	LLVM_DEBUG(dbgs() << "The matrix multiplication pattern was detected\n");			LLVM_DEBUG(dbgs() << "The matrix multiplication pattern was detected\n");
	return optimizeMatMulPattern(Node, TTI, MMI);			return optimizeMatMulPattern(Node, TTI, MMI);
	}			}
	return {};			return {};
	}			}