Page MenuHomePhabricator

[Polly] Isolate a set of partial tile prefixes to allow hoisting and sinking out of the unrolled innermost loops produced by the optimization of the matrix multiplication.
ClosedPublic

Authored by gareevroman on Jan 28 2017, 12:04 AM.

Details

Summary

In case it cannot be proved that the number of loop iterations can be evenly divided by tile sizes and we tile and unroll the point loop, the isl generates conditional expressions. Subsequently, the conditional expressions can prevent stores and loads of the unrolled loops from being sunk and hoisted.

The patch isolates a set of partial tile prefixes, which have exactly Mr x Nr iterations of the two innermost loops, the result of the loop tiling performed by the matrix multiplication optimization, where Mr and Mr are parameters of the micro-kernel. This helps to get rid of the conditional expressions of the unrolled innermost loops. Probably this approach can be replaced with padding in future.

In case of, for example, the gemm from Polybench/C 3.2 and parametric loop bounds, it helps to increase the performance from 7.98 GFlops (27.71% of theoretical peak) to 21.47 GFlops (74.57% of theoretical peak). Hence, we get the same performance as in case of scalar loops bounds.

It also cause compile time regression. The compile-time is increased from 0.795 seconds to 0.837 seconds in case of scalar loops bounds and from 1.222 seconds to 1.490 seconds in case of parametric loops bounds.

Diff Detail

Repository
rL LLVM

Event Timeline

gareevroman created this revision.Jan 28 2017, 12:04 AM

This patch also makes ScheduleTreeOptimizer::optimizeBand return a schedule node optimized with optimizeMatMulPattern. Otherwise, standardBandOpts can try to tile a band node with anchored subtree. Furthermore, it seems that it's not correct to apply standard optimizations, when the matrix multiplication has been detected. Should we commit it in a separate patch?

Meinersbur edited edge metadata.Jan 31 2017, 6:29 AM

Do you maybe have a performance comparison? Does it hurt performance when the gemm is not parametric?

lib/Transform/ScheduleOptimizer.cpp
248 ↗(On Diff #86169)

I assume && "..." is a placeholder.

273 ↗(On Diff #86169)

One empty comment line too much.

277 ↗(On Diff #86169)

__isl_take: isl_ctx parameters are never annotated, they don't use reference counting.

1135 ↗(On Diff #86169)

Who will vectorize it? Is it -polly-vectorizer=polly or do you have LLVM's loop vectorizer in mind?

1231 ↗(On Diff #86169)

Is this related?

test/ScheduleOptimizer/pattern-matching-based-opts_5.ll
16 ↗(On Diff #86169)

Can you add a description what this test was written for?

17–94 ↗(On Diff #86169)

Since you wrote the partial unrolling for the vectorizer, would it make sense to (also?) test whether vector instructions appear in the generated code. You can pass -loop-vectorize to opt so we see that the vectorization actually worked.

gareevroman edited the summary of this revision. (Show Details)

Hi Michael,

thanks for the comments! I've updated the patch according to them.

Do you maybe have a performance comparison? Does it hurt performance when the gemm is not parametric?

I have a performance comparison for the gemm from Polybench/C 3.2 and. I've added it to the commit message.

Since you wrote the partial unrolling for the vectorizer, would it make sense to (also?) test whether vector instructions appear in the generated code. You can pass -loop-vectorize to opt so we see that the vectorization actually worked.

I think that it makes sense. I've updated the test case.

However, I'm not sure what is the best way to do it. It seems that it requires to check the generated code. Consequently, there are a lot suffix digits in the checks which can change easily with otherwise unrelated changes, including with LLVM itself. What do you think about it?

Is this related?

As far as I understand, it's related. However, probably it makes sense to commit it in a separate patch. What do you think about it?

This patch also makes ScheduleTreeOptimizer::optimizeBand return a schedule node optimized with optimizeMatMulPattern. Otherwise, standardBandOpts can try to tile a band node with anchored subtree and get the error, because the use of the isolate option causes any tree containing the node to be considered anchored. Furthermore, it seems that it's not correct to apply standard optimizations, when the matrix multiplication has been detected.

Do you maybe have a performance comparison? Does it hurt performance when the gemm is not parametric?

I have a performance comparison for the gemm from Polybench/C 3.2 and. I've added it to the commit message.

Thanks for the nice speedup.

With "hurt performance when the gemm is not parametric" I was thinking about that that unrolling non-parametric loops could be unnecessary as the loop-vectorizer would handle them already? Instead, (partial?) unrolling could add an additional burden like code size blowup (or additional control flow).

I could also be wrong in that the isl scheduler will not unroll non-parametric loop, but I don't see this from this code.

Since you wrote the partial unrolling for the vectorizer, would it make sense to (also?) test whether vector instructions appear in the generated code. You can pass -loop-vectorize to opt so we see that the vectorization actually worked.

I think that it makes sense. I've updated the test case.

However, I'm not sure what is the best way to do it. It seems that it requires to check the generated code. Consequently, there are a lot suffix digits in the checks which can change easily with otherwise unrelated changes, including with LLVM itself. What do you think about it?

I would not check the entire generate code. This make the test very sensitive to even minor changes to LLVM passes. What you could do is:

  • Check the existence of vector instructions such as:
AUTO-VECTORIZATION-NEXT: fmul <4 x double>
  • Check the statistics count for the correct number of vector instructions. The SLP vectorizer has a statistic NumVectorInstructions. See e.g. llvm/test/Transforms/LoopVectorize/vect.stats.ll on how to test by statistic.

Is this related?

As far as I understand, it's related. However, probably it makes sense to commit it in a separate patch. What do you think about it?

This patch also makes ScheduleTreeOptimizer::optimizeBand return a schedule node optimized with optimizeMatMulPattern. Otherwise, standardBandOpts can try to tile a band node with anchored subtree and get the error, because the use of the isolate option causes any tree containing the node to be considered anchored. Furthermore, it seems that it's not correct to apply standard optimizations, when the matrix multiplication has been detected.

I see. The idea that standard optimization should not be applied when specialized optimization already have been applied looks like an independent change to me (although it only gives an error with this patch). Can you commit that separately?

gareevroman updated this revision to Diff 87642.Feb 8 2017, 6:15 AM
gareevroman retitled this revision from [Polly] Isolate a set of partial tile prefixes to auto-vectorize the unrolled innermost loops produced by the optimization of the matrix multiplication in case of parametric bounds. to [Polly] Isolate a set of partial tile prefixes to allow hoisting and sinking out of the unrolled innermost loops produced by the optimization of the matrix multiplication..
gareevroman edited the summary of this revision. (Show Details)

With "hurt performance when the gemm is not parametric" I was thinking about that that unrolling non-parametric loops could be unnecessary as the loop-vectorizer would handle them already? Instead, (partial?) unrolling could add an additional burden like code size blowup (or additional control flow).

If I understand you correctly, you advise to use the loop-vectorizer instead of the SLP-vectorizer to avoid unrolling. Right?

Maybe it's better. However, it would possibly require to solve a few issues.

As far as I understand, the loop-vectorizer can only vectorize the innermost loop. Consequently, it would probably need to sink and hoist accesses to the matrix C out of the two innermost loops (in case we unroll the two innermost loops, we sink and hoist accesses out of the one innermost loops).

We should probably generate a pragma, which specifies that the innermost loop should be vectorized by the loop-vectorizer. Otherwise, we would possibly depend on its cost model, which can decide that it is not profitable to vectorize the loop (as far as I understand, we have this situation in case of -polly-vectorizer=stripmine).

I could also be wrong in that the isl scheduler will not unroll non-parametric loop, but I don't see this from this code.

Right. My understanding is that isl scheduler can unroll parametric and non-parametric loops.

Sorry for the confusion. The problem is that in case it cannot be proved that the number of loop iterations can be evenly divided by tile sizes and we tile and unroll the point loop, the isl generates conditional expressions. Subsequently, the conditional expressions can prevent stores and loads of the unrolled loops from being sunk and hoisted. I should have specified that in this case they can still be vectorized, but it causes a run-time regression in comparison to the vectorized code with sunk and hoisted memory accesses. To clarify it, I've updated the title and the commit message. I've also added a test case with the number of loop iterations that cannot be evenly divided by tile sizes.

I would not check the entire generate code. This make the test very sensitive to even minor changes to LLVM passes. What you could do is:

Check the existence of vector instructions such as:
AUTO-VECTORIZATION-NEXT: fmul <4 x double>
Check the statistics count for the correct number of vector instructions. The SLP vectorizer has a statistic NumVectorInstructions. See > e.g. llvm/test/Transforms/LoopVectorize/vect.stats.ll on how to test by statistic.

Thanks for the advice. I've also added checks of the statistics count for licm, because we should probably also check that stores and loads of the unrolled loops are sunk and hoisted.

Meinersbur accepted this revision.Feb 8 2017, 6:35 AM

If I understand you correctly, you advise to use the loop-vectorizer instead of the SLP-vectorizer to avoid unrolling. Right?

The previous description made me think there are two cases:

  1. loops with constant bounds
  2. loops with parametric bounds

and I though you aimed for case 2) only (and handled by the slp-vectorizer), while case 1) be left to the loop-vectorizer. I did not see this distinction in the code and assumed that case 1) would be accidentally handled as well. Your new description is much better. Thanks.

A shorter title for the commit would be nice. The title now can be the first line of the message.

test/ScheduleOptimizer/pattern-matching-based-opts_6.ll
102–108 ↗(On Diff #87642)

nice

103 ↗(On Diff #87642)

I think there is no requirement to be the following instruction, so I suggest to drop the -NEXT (sorry if it seemed that I I suggested it)

This revision is now accepted and ready to land.Feb 8 2017, 6:35 AM
This revision was automatically updated to reflect the committed changes.