This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Disable matmul pattern-match + -polly-parallel
Needs RevisionPublic

Authored by bollu on May 8 2022, 8:42 PM.

Details

Summary

The two interact poorly and lead to miscompiles
(see: https://groups.google.com/g/polly-dev/c/gYyPbg5yWOs/m/dPyhM_w-AwAJ).
We disable the matrix multiplication pattern match if polly-parallel
is supplied.

Diff Detail

Event Timeline

bollu created this revision.May 8 2022, 8:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 8 2022, 8:42 PM
Herald added a subscriber: arjunp. · View Herald Transcript
bollu requested review of this revision.May 8 2022, 8:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 8 2022, 8:42 PM
grosser accepted this revision.May 8 2022, 8:55 PM

This looks good from my side. However, maybe Michael Kruse has further remarks?

This revision is now accepted and ready to land.May 8 2022, 8:55 PM
Meinersbur requested changes to this revision.May 9 2022, 7:10 PM

This fixes the problem:

diff --git a/polly/lib/Transform/ScheduleOptimizer.cpp b/polly/lib/Transform/ScheduleOptimizer.cpp
index fad62d9b20830..f2cf06af2f7c8 100644
--- a/polly/lib/Transform/ScheduleOptimizer.cpp
+++ b/polly/lib/Transform/ScheduleOptimizer.cpp
@@ -978,7 +978,7 @@ runIslScheduleOptimizerUsingNPM(Scop &S, ScopAnalysisManager &SAM,
   OptimizationRemarkEmitter ORE(&S.getFunction());
   TargetTransformInfo *TTI = &SAR.TTI;
   isl::schedule LastSchedule;
-  bool Modified = runIslScheduleOptimizer(S, GetDeps, TTI, &ORE, LastSchedule);
+  runIslScheduleOptimizer(S, GetDeps, TTI, &ORE, LastSchedule);
   if (OS) {
     *OS << "Printing analysis 'Polly - Optimize schedule of SCoP' for region: '"
         << S.getName() << "' in function '" << S.getFunction().getName()
@@ -986,13 +986,11 @@ runIslScheduleOptimizerUsingNPM(Scop &S, ScopAnalysisManager &SAM,
     runScheduleOptimizerPrinter(*OS, LastSchedule);
   }

-  if (!Modified)
-    return PreservedAnalyses::all();
-
   PreservedAnalyses PA;
   PA.preserveSet<AllAnalysesOn<Module>>();
   PA.preserveSet<AllAnalysesOn<Function>>();
   PA.preserveSet<AllAnalysesOn<Loop>>();
+  PA.abandon<DependenceAnalysis>();
   return PA;
 }

The optimization adds new statements without updating (or invalidating) DependenceInfo. Unfortunately this now means that DependenceAnalysis will run a second time for AstInfo. Also consider adding a comment to why DependenceAnalysis is abandoned here and the legacy pass manager. Maybe you find a way to only invalidate it when matmul optimization has been applied.

We should also discuss about

diff --git a/polly/lib/Transform/MatmulOptimizer.cpp b/polly/lib/Transform/MatmulOptimizer.cpp
index 60dd9eda3c2c0..c46025522bc23 100644
--- a/polly/lib/Transform/MatmulOptimizer.cpp
+++ b/polly/lib/Transform/MatmulOptimizer.cpp
@@ -491,9 +491,6 @@ createMacroKernel(isl::schedule_node Node,
   Node = permuteBandNodeDimensions(Node, DimOutNum - 2, DimOutNum - 1);
   Node = permuteBandNodeDimensions(Node, DimOutNum - 3, DimOutNum - 1);

-  // Mark the outermost loop as parallelizable.
-  Node = Node.as<isl::schedule_node_band>().member_set_coincident(0, true);
-
   return Node.child(0).child(0);
 }

I added this in aa8a976174c7ac08676bbc7bb647f6bc0efd2e72 and I think it does not actually make anything parallel, but I am not sure it is actually allowed due to Packed_A shared between all the threads.

This revision now requires changes to proceed.May 9 2022, 7:11 PM
grosser requested changes to this revision.May 9 2022, 9:29 PM

This is excellent. Thank you Michael!

gareevroman added a subscriber: gareevroman.EditedMay 15 2022, 11:16 PM

I added this in aa8a976174c7ac08676bbc7bb647f6bc0efd2e72 and I think it does not actually make anything parallel, but I am not sure it is actually allowed due to Packed_A shared between all the threads.

As far as I know, parallelizing the outermost loop may be a good idea only on mulitsocket systems, where each CPU has a separate L3 cache (please, see [1]). Additionally, it would require to replicate Packed_A and Packed_B to avoid the race condition by using OpenMP parallelism.

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

If the L2 cache is not shared and the second loop is parallelized, then the elements of the created Packed_A are duplicated across the L2 caches. In this case, the first loop around the macro-kernel, can be considered. If we parallelize this loop, then each thread will be assigned different elements of the matrix A, which reside in the L2 cache, and they are packed into the created array Packed_A, where this may cause a race condition. So, replication of elements of Packed_A would require.

Probably, the best option is to add additional flag that specifies whether L2 cache is shared. Depending on it, we can use different parallelization strategies. I think such a flag should be set to true by default.

Refs.:

[1] - Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS’14). IEEE Computer Society, Washington, DC, 1049–1059. DOI:http://dx.doi. org/10.1109/IPDPS.2014.110

[2] - Roman Gareev, Tobias Grosser, and Michael Kruse. 2018. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach. ACM Trans. Archit. Code Optim. 15, 3, Article 34 (August 2018), 27 pages. https://doi.org/10.1145/3235029

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

I fear that $loop_4$ does not have enough work to justify the parallelization overhead. Also, there will be false sharing between cache lines. It could be reduced by having the #pragma omp parallel outside the matrix multiplication, and only #pragma omp for on $loop_4$. However, Polly does not support that yet.

The usual candidate for coarse-grain parallelization is always the outermost one, unless we want to exploit a shared cache but that would be optional. We'd divide the packed array size equally between threads.

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

I fear that $loop_4$ does not have enough work to justify the parallelization overhead.

Such a strategy is used in many hand tuned libraries that contain implementations of matrix multiplication. In general case, it helps to obtain good results.

Also, there will be false sharing between cache lines. It could be reduced by having the #pragma omp parallel outside the matrix multiplication, and only #pragma omp for on $loop_4$. However, Polly does not support that yet.

Could you elaborate on why there will be false sharing between cache lines?

The usual candidate for coarse-grain parallelization is always the outermost one, unless we want to exploit a shared cache but that would be optional. We'd divide the packed array size equally between threads.

Goto's algorithm for matrix multiplication, which is implemented in BLIS, OpenBLAS, and, in particular, Polly, is based on the effective usage of the cache hierarchy. So, I would propose to exploit the knowledge about the data usage pattern eventually. In my experience, the parallelization of the loop $loop_1$ produces a worser performance than the parallelization of the loop $loop_4$ in many cases.

I would suggest to parallelize the second loop around the micro-kernel by default. It would not violate the dependencies. In general, it can provide a good opportunity for parallelization (please, see [1] and [2]). In particular, the reduction of time spent in this loop may cancel out the cost of packing the elements of the created array Packed_A into the L2 cache.

I fear that $loop_4$ does not have enough work to justify the parallelization overhead.

Such a strategy is used in many hand tuned libraries that contain implementations of matrix multiplication. In general case, it helps to obtain good results.

Do you have measurements?

Also, there will be false sharing between cache lines. It could be reduced by having the #pragma omp parallel outside the matrix multiplication, and only #pragma omp for on $loop_4$. However, Polly does not support that yet.

Could you elaborate on why there will be false sharing between cache lines?

The innermost kernel:
C[..][j_c] += ...

When parallelizing the j_c loop,
C[..][j_c+N_r - 1] and C[..][j_c+N_r] might be in the same cache line, but processed by different threads. To avoid, we'd need to ensure that N_r is larger than the cache lines of and levels and also align to that size.

The usual candidate for coarse-grain parallelization is always the outermost one, unless we want to exploit a shared cache but that would be optional. We'd divide the packed array size equally between threads.

Goto's algorithm for matrix multiplication, which is implemented in BLIS, OpenBLAS, and, in particular, Polly, is based on the effective usage of the cache hierarchy. So, I would propose to exploit the knowledge about the data usage pattern eventually. In my experience, the parallelization of the loop $loop_1$ produces a worser performance than the parallelization of the loop $loop_4$ in many cases.

Do you have measurements?

Polly inserts #pragma omp parallel for at the the parallized which introduces thread fork-and-join overhead. It would be cheaper to have just one #pragma omp parallel at the outermost level (to spawn threads) and only #pragma omp for for $loop_4$. Polly currently does not support this.

@bollu Do you intend to update this patch?

Meinersbur resigned from this revision.Jun 29 2022, 3:25 PM

DependenceInfo is now abandoned after matmul optimization (rG6fa65f8a98967a5d2d2a6863e0f67a40d2961905)