Page MenuHomePhabricator

[Polly][MatMul][WIP] Disable the Loop Vectorizer
ClosedPublic

Authored by gareevroman on Aug 19 2017, 10:57 AM.

Details

Summary

The Loop Vectorizer can generate mis-optimized code https://bugs.llvm.org/show_bug.cgi?id=34245

In case of GEMM, we use only the SLP Vectorizer out of two LLVM vectorizers. Consequently, we disable the Loop Vectorizer for the innermost loop using mark nodes and emitting the corresponding metadata.

P.S.: I haven't managed to insert the mark nodes before AST for nodes, since the isolation can produce if statements and, AFAIU, we aren't able to modify their children. For example, we can get the following

// Mark node
if (…)
  for (…)

Consequently, we aren't able to modify the for loop during handling of the mark node.

Diff Detail

Repository
rL LLVM

Event Timeline

gareevroman created this revision.Aug 19 2017, 10:57 AM
grosser edited edge metadata.Aug 19 2017, 11:07 AM

LGTM, besides the broken comment.

lib/CodeGen/IslNodeBuilder.cpp
492 ↗(On Diff #111842)

This comment looks out of place?

grosser accepted this revision.Aug 19 2017, 11:07 AM
This revision is now accepted and ready to land.Aug 19 2017, 11:07 AM

I suggest that you wait on this until we understand the regression.

I suggest that you wait on this until we understand the regression.

In part, it's possible that this is the wrong fix. Based on the bug report, it is possible that the problem is that the vectorizer is generating runtime checks. Maybe aliasing metadata would help. Maybe it's unrolling too much.

I suggest that you wait on this until we understand the regression.

In part, it's possible that this is the wrong fix. Based on the bug report, it is possible that the problem is that the vectorizer is generating runtime checks. Maybe aliasing metadata would help. Maybe it's unrolling too much.

Currently, in case of GEMM detected and optimized by Polly, only the SLP vectorizer is needed out of two LLVM vectorizers. Usually the Loop Vectorizer doesn't affect the code. Consequently, as far as I understand, this fix shouldn't hurt anything.

To generate BLIS micro-kernel [1], we apply tiling and unroll the two innermost loops. Subsequently, we use LICM to sink and hoist all stores and loads form the innermost loop, and apply the SLP vectorizer to get a sequence of rank-1 updates.

I think that the application of the Loop Vectorizer instead of the SLP Vectorizer is the way to improve matrix optimization of Polly. However, as far as I understand, there is no way to make the Loop Vectorizer vectorize a loop and sink and hoist accesses. All available metadata are only optimization hints and the optimizer will only interleave and vectorize loops if it believes it's safe to do so. Consequently, it'd probably require the involvement of maintainers of the Loop Vectorizer.

P.S.: Af far as I know, in case we unroll only the innermost loop, the Loop Vectorizer does unroll and vectorize the second loop. However, it doesn't sink and hoist stores and loads. Also, I haven't tested how it vectorizes different forms of generalized matrix multiplication, for example, the shortest path problem which can use the following kernel:

#define MIN(X, Y) (((X) < (Y)) ? (X) : (Y))

  for (i = 0; i < _PB_NI; i++)
    for (j = 0; j < _PB_NI; j++)
      for (k = 0; k < _PB_NI; k++)
        L[i][j] = MIN(L[i][j], W[i][k] + W[k][j]);

Refs.:

[1] - https://pdfs.semanticscholar.org/cb77/c2fdf8132f5e88f09b253a9fe01b65da7bc4.pdf

Hi Hal,

I think this is conceptually the right approach. We currently generate code -- with explicit register unrolling -- and expect the SLP vectorizer to perform the vectorization. I believe communicating this information via explicit metadata is reasonable.

We may want to move towards using the LLVM loop vectorizer rather than the SLP vectorizer, but this requires both changes to the loop vectorizer and to our code generation strategy. We should certainly consider this, but I feel that this could be separate steps. 1) clarify current behavior and fix regressions, 2) expand the loop vectorizer, 3) change our code generation logic.

Roman, I think Hal is right that we should look into how to improve the loop vectorizer. It would be great if you could add more information to the bug report, such that others can -- independently of polly -- understand what optimization the current loop vectorizer does not do we be needed for us.

Hi Hal,

I think this is conceptually the right approach. We currently generate code -- with explicit register unrolling -- and expect the SLP vectorizer to perform the vectorization. I believe communicating this information via explicit metadata is reasonable.

We may want to move towards using the LLVM loop vectorizer rather than the SLP vectorizer, but this requires both changes to the loop vectorizer and to our code generation strategy. We should certainly consider this, but I feel that this could be separate steps. 1) clarify current behavior and fix regressions, 2) expand the loop vectorizer, 3) change our code generation logic.

Okay, I understand. In that case, this is fine. If you're explicitly setting up for the SLP vectorizer, then we don't want the loop vectorizer to get in the way. However, at least in theory, this is probably suboptimal and we should figure out how to make it better.

Roman, I think Hal is right that we should look into how to improve the loop vectorizer. It would be great if you could add more information to the bug report, such that others can -- independently of polly -- understand what optimization the current loop vectorizer does not do we be needed for us.

Yes, please do.

I suggest that you wait on this until we understand the regression.

In part, it's possible that this is the wrong fix. Based on the bug report, it is possible that the problem is that the vectorizer is generating runtime checks. Maybe aliasing metadata would help. Maybe it's unrolling too much.

Currently, in case of GEMM detected and optimized by Polly, only the SLP vectorizer is needed out of two LLVM vectorizers. Usually the Loop Vectorizer doesn't affect the code. Consequently, as far as I understand, this fix shouldn't hurt anything.

To generate BLIS micro-kernel [1], we apply tiling and unroll the two innermost loops. Subsequently, we use LICM to sink and hoist all stores and loads form the innermost loop, and apply the SLP vectorizer to get a sequence of rank-1 updates.

I think that the application of the Loop Vectorizer instead of the SLP Vectorizer is the way to improve matrix optimization of Polly. However, as far as I understand, there is no way to make the Loop Vectorizer vectorize a loop and sink and hoist accesses. All available metadata are only optimization hints and the optimizer will only interleave and vectorize loops if it believes it's safe to do so.

FYI: This is not completely true. You can add ‘llvm.mem.parallel_loop_access‘ metadata to cause the vectorizer to assume vectorization safety. Look at what Clang generates if you use #pragma clang loop vectorize(assume_safety) or #pragma omp simd for. Using this metadata may allow you to setup the loop for efficient loop vectorization.

Consequently, it'd probably require the involvement of maintainers of the Loop Vectorizer.

P.S.: Af far as I know, in case we unroll only the innermost loop, the Loop Vectorizer does unroll and vectorize the second loop. However, it doesn't sink and hoist stores and loads. Also, I haven't tested how it vectorizes different forms of generalized matrix multiplication, for example, the shortest path problem which can use the following kernel:

#define MIN(X, Y) (((X) < (Y)) ? (X) : (Y))

  for (i = 0; i < _PB_NI; i++)
    for (j = 0; j < _PB_NI; j++)
      for (k = 0; k < _PB_NI; k++)
        L[i][j] = MIN(L[i][j], W[i][k] + W[k][j]);

Refs.:

[1] - https://pdfs.semanticscholar.org/cb77/c2fdf8132f5e88f09b253a9fe01b65da7bc4.pdf

Hi Hal,

I think this is conceptually the right approach. We currently generate code -- with explicit register unrolling -- and expect the SLP vectorizer to perform the vectorization. I believe communicating this information via explicit metadata is reasonable.

We may want to move towards using the LLVM loop vectorizer rather than the SLP vectorizer, but this requires both changes to the loop vectorizer and to our code generation strategy. We should certainly consider this, but I feel that this could be separate steps. 1) clarify current behavior and fix regressions, 2) expand the loop vectorizer, 3) change our code generation logic.

Roman, I think Hal is right that we should look into how to improve the loop vectorizer.

Yes, I think this is a good idea.

It would be great if you could add more information to the bug report, such that others can -- independently of polly --

My understanding is that the bug report already contains the test case, which is independent of Polly. Sorry, I haven't managed to reduce it yet.

understand what optimization the current loop vectorizer does not do we be needed for us.

Probably, it should be described in a separate bug report. I'll try to do it soon.

This revision was automatically updated to reflect the committed changes.