This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Generalize the pattern matching to the case of tensor contractions.
ClosedPublic

Authored by gareevroman on Nov 21 2021, 6:10 AM.

Details

Summary

The pattern matching optimization of Polly detects and optimizes dense general matrix-matrix multiplication. The generated code is close to high performance implementations of matrix-matrix multiplications, which are contained in manually tuned libraries [1]. The described pattern matching optimization is a particular case of tensor contraction optimization, which was introduced in [2].

This patch generalizes the pattern matching to the case of tensor contractions using the algorithm described in [2]. Following the ideas introduced in [3], it logically represents tensor contraction operands as matrix multiplication operands and uses the approach presented in [1].

Optimization of tensor contractions will be added in the next patch. These modifications can be found in https://github.com/gareevroman/llvm-project.

[1] - Low T.M., Igual F.D., Smith T.M., Quintana-Orti E.S. Analytical Modeling Is Enough for High-Performance BLIS // ACM Transactions on Mathemat­ical Software. 2016. Vol. 43, no. 2. P. 12:1—12:18. DOI: 10.1145/2925987.

[2] - Gareev R., Grosser T., Kruse M. High-Performance Generalized Tensor Op­erations: A Compiler-Oriented Approach // ACM Transactions on Architec­ture and Code Optimization (TACO). 2018. Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.

[3] - Matthews D. High-Performance Tensor Contraction without BLAS // SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. C 1—C 24. DOI: 110.1137/16m108968x.

Diff Detail

Event Timeline

gareevroman created this revision.Nov 21 2021, 6:10 AM
gareevroman requested review of this revision.Nov 21 2021, 6:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 21 2021, 6:10 AM

Thanks for upstreaming your tensor optimization!

The naming of polly-pattern-matching-based-opts suggests that it includes all pattern-based optimizations, yet this introduces another flag polly-pattern-matching-based-tc-opts. I'd prefer polly-pattern-matching-based-opts controlling both optimizations, and then additional flags for enabling matrix-multiplication and tensor optimizations. Alternatively, rename polly-pattern-matching-based-opts to e.g. polly-matmul-opt. Also, you are adding this functionality into a file called MatmulOptimizer and a function called tryOptimizeMatMulPattern. Since matrix-multiplication is a tensor contraction, is ts-opt supposed to supperseed matrix multiplication? In short, I would like to know what what the relation between those two optimizations should be. I'd prefer to not have to maintain two optimizations if one is strictly more powerful than the other.

I'd enjoy some occasional comments and grouping of statements (empty lines) inside the functions in addition to the doxygen comments. For instance isTCOperandAcc is just a wall of code. For such property-checking functions, ideally each return should mention what property is violated here and why this property is required to be compatible with tensor optimization.

polly/lib/Transform/MatmulOptimizer.cpp
180–181

Please add more details on what the members represent.

187–189

std::set is a high-overhead implementation. Consider using DenseSet or SmallDenseSet. See https://www.llvm.org/docs/ProgrammersManual.html#llvm-adt-denseset-h

190–196

Is there an argument to use 30 and small size? If not, consider using just SmallVector<int>.

1091

[style] No reason to make this a wide string literal, especially if just used as an assertion failed message.

Apples to other occurrences as well.

1110
1116

or introduce intFromIslSize.

1149–1150

SmallVectorImpl is not specific to what the vector's small size is.

1160

Consider using polly::rangeIslSize for iterating over dimensions.

1187

Although already in an anon namespace, the other methods add static as well. I found it helps the compiler to warn if a static function is unnused.

1189–1190

Do you know of #include <llvm/ADT/SetOperations.h>? Unfortunately, these modify one set rather than returning a new set.

1219
1229–1230

This computes whether two sets a disjoint, it should not be required to compute the intersection.

1246
1284

getAccessesInOrder requires Stmt to not be a RegionStmt. Please add a test for it.

1294

If any of the returns are executed, what causes the pattern to be rejected (it's not return false)?

1363–1364
1365–1367

The check should not depend on n_basic_set, which is fragile and depends on whether on eg. coalesce was successful. Consider using something like polly::getConstant.

1411–1412

Consider lexmin_pw_multi_aff/lexmax_pw_multi_aff

polly/lib/Transform/ScheduleOptimizer.cpp
459–463

Instead of modifying the idea of whether a node is tilable, consider introducing another constraint-checking function, as we should have done with prevectorization as well.

Thank you very much for the review!

I've tried to address all comments. Additionally, I've updated the optimization for the case of filter nodes (e.g., polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll). I believe that the optimization of tensor contractions is strictly more powerful than the optimization of matrix-multiplications. So, I'd suggest to replace the optimization of matrix-multiplications with it eventually.

gareevroman marked 19 inline comments as done.Dec 12 2021, 1:30 AM
gareevroman added inline comments.
polly/lib/Transform/MatmulOptimizer.cpp
1116

As far I understand, Dimensions.size() returns a value of type size_t instead of a value of the type isl_size. So, in the new version I used the unsigned type to avoid the cast.

1229–1230

That check is redundant. Thanks.

1246

As far as I understand, we cannot do this here because of the assignment to TCI.ADimensions and TCI.BDimensions

1284

I’ve added a check to containsOnlyTCAcc. Could you clarify how the test case should look like? Should it be a region statement that contains a matrix multiplication with right order of memory accesses?

1365–1367

I think that this check was redundant. I’ve removed it.

gareevroman marked 5 inline comments as done.Dec 12 2021, 1:31 AM

I am extremely sorry for the late review, I hope you could use the time to work on/polish the follow-up patch. Some things are a bit hard to understand independently. I promise to review in a more timely manner from now on, although I might always add that many comments.

Please add to the tests cases what they are supposed to be testing and maybe give them more meaningful filenames.

The following is successfully detected as tensor contraction. Is this intended?

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; ++l)
              if (l != 0)
                for (int w = 0; w < 64; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

It might be if the codegen part is able exclude the element 0. In contrast, this one is rejected:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; l++)
             for (int w = 0; w < 64; ++w)
                if (w != 0)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

or this:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; l+=2)
             for (int w = 0; w < 64; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
     for (int i = 0; i < 32; i++)
         for (int j = 0; j < 32; j++)
           for (int l = 0; l < 32; l++)
             for (int w = 0; w < 32; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][i+3];
}

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
    for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
    for (int l = 0; l < 32; l++)
    for (int w = 0; w < 32; w++)
        C[i][j] += A[i][l][w] * B[w][j][i];
}

Is it possible to make it work with -polly-reschedule=0 as well? AFAICS there is nothing that requires rescheduling and would allow us to test the detection independently. isTCPattern seems to already consider multiple bands; the hurdle at the moment seems to be that the bands are not marked as permutable.

polly/lib/Transform/MatmulOptimizer.cpp
1590–1591
  1. should not be necessary; any permutation of the surrounding loops can be valid. Eg,
for (w = 0; w < 64; ++w)
  for (l = 0; l < 64; ++l)
    for (i = 0; i < 1024; i++)
      for (j = 0; j < 1024; j++)
           C[i][j] += A[i][l][w] * B[w][j][l];

yields the same result.

1602

Prefer Node.isa<isl::schedule_node_leaf>() (and then typed subclass: Node.as<isl_schedule_node_leaf>())

Meinersbur added inline comments.Mar 28 2022, 6:37 PM
polly/lib/Transform/MatmulOptimizer.cpp
183–186

Please use doxygen comments to describe class/struct members. @{ @} can be used to group them.

Also add an empty line before a new comment begins.

1116

rangeIslSize should make it easier.

1136

I like the idea of verifying the correctness by reconstructing and comparing to the original.

Maybe do it at the end to verify that the entire TCInfoTy is correct? On the other size, earlier fail would be better. What do you think?

1161–1164

This is a weird way to find out which indices map to what other index, I guess the equivalent of isMatMulOperandAcc. It requires that the dimension number if part of the AccMap's range, and if there is any expression will fail (eg Stmt[i][j] -> A[i-1][j+1] matches the wrong dimensions), but at least there is the verification afterwards.

I am not sure I like this sort of cleverness; I'd rather expect some sort of introspection into the map's coefficients, but I also think this should work in nearly all relevant cases and should be save due to the verification, so lets keep it.

However, please document this better, eg. add an example on what is expected to happen.

1171–1172

The "plain" function are unfortunately not very robust, eg its result is different depending on the internal representation. I'd suggestion getConstant (from ISLTools) but only takes an pw_aff.

Could you extract uses of plain_get_val_if_fixed into such a function, and mark it as TODO to cope with it later?

1219

I can use JScopImport to set a scalar memory access to a partial write without adding additional dependencies; that is, I don't think this can be just ignored.

I suggest to have a single function that calls getAccessesInOrder and sort out which MemAccess is read/write in there, then analyze them.

1250–1252

Introduce a is_superset (etc) call?

1270–1272

Could we add utility functions such that this becomes unite(J, P) == IndexSet?

1284

Test in containsOnlyTCAcc is exactly what I was looking for. A region statement could look like this:

c = C[i][j];
if (/*non-affine condition*/) {
  (void)A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}

which has the correct order of accesses but is obviously not what we are looking for.

1374
1395

This seems to check setermine whether there is a reduction (contraction) carried by loop number Pos. The function name could be more meaningful. Suggestion: isDepCarryingReductionOverDim (not nice, but "TcDep" can mean anything)

1396

Consider passing by const-reference.

1399

plain_get_val_if_fixed is not really robust as it depends on the internal representation that can be different after eg simplify.

Since this just checks to a specific value, the best would be to create a new set where all the fixed dimensions are that value (here: 1), and check whether DepDelta is a subset of it.

1410–1412

Why not BoundDeltas.subtract() instead of deltas?

1451–1454

Should we also check whether WAW, RAW dependences are incompatible?

1460–1461

lexmin/lexmax can be expensive. Wrap into a IslMaxOperationsGuard?

1462

You seem to assume an functional relationship from here on. If that's the case, you can keep the type a pw_multi_aff which supports more functions that you may have missed such as pw_multi_aff::add

1464

Consider using reverse(rangeIslSize(0,DeltasDimNum)) (from ISLTools.h).

1465–1470

This is going to check whether each element out of Intersection is a contraction over dimension i. Don't we also need to check that every iteration out of the band i is contributing to that contraction?

1574

This is not part of the pattern detection, but the optimization. Could we move it to the patch that does the actual optimization?

1586–1587

Could you describe here what those 4 accesses are?

1592

Could you add a high-level description how the algorithm actually works? I.e. dependencies used to determine contraction dimensions, etc.

1607–1610

This condition is effectively identical to the next

1613

This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle for now.

1616–1617
1622

This looks for the outermost node that is not a filter or band. Is it possible that while that outermost node is not a TC contraction, one of the inner ones might? What if the outermost node is a filter, looks like it would just return false in this case.

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll
13
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
1–3

Since this is not FileCheck-ing the LLVM-IR output, suppress it with -disable-output

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2022, 6:37 PM

Thank you very much for the review! I am sorry for the late response. I will try to to address all your comments within the next few weeks.

Thank you very much for the review! I am sorry for the late response. I will try to to address all your comments within the next few weeks.

No worries, I wasn't responsive either :-(

gareevroman marked 16 inline comments as done.EditedMay 9 2022, 8:06 AM

1

The following is successfully detected as tensor contraction. Is this intended?

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {

for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; ++l)
         if (l != 0)
           for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];

}

Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

It might be if the codegen part is able exclude the element 0. In contrast, this one is rejected:

In this case, the codegen excludes the element 0 for i2. I added a test case for this.

domain: "{ Stmt4[i0, i1, i2, i3] : 0 <= i0 <= 1023 and 0 <= i1 <= 1023 and 0 < i2 <= 63 and 0 <= i3 <= 63 }"

In contrast, this one is rejected:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {

for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];

}

or this:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {

for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l+=2)
        for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];

}

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.

ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };

2

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
      for (int l = 0; l < 32; l++)
        for (int w = 0; w < 32; ++w)
             C[i][j] += A[i][l][w] * B[w][j][i+3];

}

I fixed the isTCOperandAcc function and checked that all other asserts are used properly.

3

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
      for (int l = 0; l < 32; l++)
        for (int w = 0; w < 32; ++w)
             C[i][j] += A[i][l][w] * B[w][j][i+3];

}

I fixed the isTCOperandAcc function and checked that all other asserts are used properly.

4

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];

}

For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.

1347 static bool containsOnlyTCAcc(isl::set Domain, isl::map PartialSchedule,

                            TCInfoTy &TCI) {
...

1365 if (intersect(IandJIndexSet, TCI.P).size() != 0)
1366 return false;

5

I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

polly/lib/Transform/MatmulOptimizer.cpp
1116

I think rangeIslSize can’t be used in this case. However, I’ve tried to use rangeIslSize to improve the patch.

1136

Other parts of TCInfoTy are verified in isTCOperandAcc too. I think that it would be better to verify the related information in one place as much and as early as possible.

Probably, the earlier fail would simplify the debugging, since we exactly know the form of memory accesses and can rely on it. Additionally, the performance can be improved, since the earlier fail helps to avoid additional operations with sets.

1219

I am not sure whether modifications of implementations of tensor contractions, which contain read and write scalar memory accesses, are useful in practice.

Moreover, since bundles of induction variables I, J, P can contain an unlimited number of dimensions, we possibly cannot follow the algorithm from the containsOnlyMatrMultAcc function, which permutes dimensions and checks that additional memory accesses have stride 0 in terms of dimensions MMM.i, MMM.j, and MMM.k. Consequently, such memory accesses can be treated as scalar memory accesses. I have not come up with an effective alternative yet.

That is why I do not consider scalar memory accesses in getWriteAccess and setReadAccesses functions. Could we mark it as TODO and do it future?

1284

Thanks for the example! I have added a corresponding test case. If I am not mistaken, it requires DeLICM.

1395

Could we name it isReductionCarriedOverDim? I think, in this case, we should rename the parameter Pos to Dim to make it more readable.

1410–1412

As far as I understand, these operations are not equal.

deltas computes a set containing the differences between image elements and corresponding domain elements in the input. subtract computes a subtraction of sets.

For example, in the case of the following sets they compute the following:

BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31] }
isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, 0, 0, -1]}

BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
deltas: {Stmt_for_body15[31, 31, 31, 31, 31, 32]}

BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, -1, 0, 31]}

BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
deltas: {Stmt_for_body15[31, 31, 31, 32, 31, 0]}

These comment interferes with the comment about pw_multi_aff. Consequently, I replaced the usage of isl_map_deltas with operations on pw_multi_aff.

1451–1454

As far as I understand, that is not necessary, because subsequently we check that the statement has the form C(shuffle(I, J)) = E(A(shuffle(I, P)),B(shuffle(P, J))C(shuffle(I, J))), where E is an expression that contains reads from the tensors A, B, C, and an arbitrary number of reads from constants with respect to bundles I, J , and P.

I have added a comment that describes this.

"The form of anti and output dependencies is determined specified by the form the SCoP statement, which is checked by subsequent analysis."

1460–1461

What is the maximal amount of computational steps we should use by default? I set it to 500000 according to DependenceInfo.cpp.

1465–1470

Could you clarify what do you mean by the band i? Are these indexes ki, which describe the dependencies?

isTcDep checks that the dependency has the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) -> S(..., ki + 1, min(k(i + 1)), ..., min(kn), …)

1602

Could we factor out this condition into ScheduleTreeOptimizer::isPMOptimizableBandNode, since it is common for isTCPattern and isMatrMultPattern functions? A new version of the patch shows how it could look like.

1613

Could we add a TODO comment for this?

1622

If I am not mistaken, this only checks that all band nodes, which represent the statement, are not split by filter nodes. These accepts a straightforward implementation of TC with/without delicm. For example,

domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and

                              0 <= i1 <= 1799 and
                              0 <= i2 <= 2199;
Stmt_for_body3[i0, i1] :      0 <= i0 <= 1599 and
                              0 <= i1 <= 1799;
Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and
                              0 <= i1 <= 1799 }"

child:

sequence:
- filter: "{ Stmt_for_body3[i0, i1] }"
  child:
    schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]"
    permutable: 1
    coincident: [ 1, 1 ]
- filter: "{ Stmt_for_body3_last[i0, i1] }"
  child:
    schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]"
    permutable: 1
    coincident: [ 1, 1 ]
- filter: "{ Stmt_for_body8[i0, i1, i2] }"
  child:
    schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] },
                { Stmt_for_body8[i0, i1, i2] -> [(i1)] },
                { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]"
    permutable: 1
    coincident: [ 1, 1, 0 ]

domain: "{ Stmt2[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }"
child:

schedule: "[{ Stmt2[i0, i1, i2] -> [(i0)] }, { Stmt2[i0, i1, i2] -> [(i1)] }, { Stmt2[i0, i1, i2] -> [(i2)] }]"
permutable: 1
coincident: [ 1, 1, 0 ]

Sorry, I have not committed an updated version of the optimization of TC to my github repo. However, I believe that, if this is that case, we can safely replace all such nodes.

+ auto NodeType = isl_schedule_node_get_type(Node.get());
+ while ((NodeType != isl_schedule_node_domain) &&
+ (NodeType != isl_schedule_node_filter)) {
+ assert((NodeType != isl_schedule_node_sequence) &&
+ L"Prevent the undefined behavior");
+ Node = Node.parent();
+ NodeType = isl_schedule_node_get_type(Node.get());
+ }
+ Node = Node.child(0);
+ Node = isl::manage(isl_schedule_node_cut(Node.release()));
+ return Node.insert_partial_schedule(Dimensions);

I think taht the detection of a more sophisticated implementations of TC is a possible goal of a future research.

I have described this in the comment.

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
1–3

Could we fix the existing test cases in a separate patch?

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll

; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
; REQUIRES: asserts

polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

polly/test/ScheduleOptimizer/pattern-matching-based-opts_17.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

polly/test/ScheduleOptimizer/pattern-matching-based-opts_18.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

polly/test/ScheduleOptimizer/pattern-matching-based-opts_19.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

polly/test/ScheduleOptimizer/pattern-matching-based-opts_20.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

ormris removed a subscriber: ormris.May 16 2022, 10:58 AM

1
Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

Please document what pattern is intended to be recognized. I don't think the doc for isTCPattern is sufficient, it only mentioned what is checked. Documenting the intended pattern would help identifying if a check has been forgotten. E.g. for the statement domain.

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.

ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };

What do you mean by "codegen modifies some memory accesses"? Polly's Codegen? What is the check to exclude this? Is the 1 + i3 memory access expression? Where does it come from?

4

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];

}

For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.

Should or should not be detected?

While debugging, it is now rejected because of this:
``

if (!TCI.B) {
  // IndexSet should be a union of J and P sets.
  if (unite(TCI.P, TCI.J) != IndexSet)
    return false;

``

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J. `op=` is a commutative operation ...., c is an affine condition usually just `true`. `f` is a side-effect free operation.

(I don't whether this is correct, I want to understand whether the checked conditions are sufficient).

5

I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

Ok.

polly/lib/Transform/MatmulOptimizer.cpp
204

@{ is not needed when documenting just a single member.

1246

The assignments should just make a copy of the array . With Dimensions being passed by-value, the caller has to make the copy which it should not need to.

SmallVector has no overload for being assigned an ArrayRef, but you could use llvm::append_range to insert all the values.

1291

Compiler warning:

/home/meinersbur/src/llvm-project/polly/lib/Transform/MatmulOptimizer.cpp:1310:13: warning: moving a temporary object prevents copy elision [-Wpessimizing-move]
    TCI.I = std::move(set_difference(IndexSet, TCI.P));

The result of set_difference is already an r-value, no need to cast it to an r-value.

1296

Same compiler warning.

1395

Sounds ok.

gareevroman marked 6 inline comments as done.Jun 12 2022, 4:12 AM

1
Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

Please document what pattern is intended to be recognized. I don't think the doc for isTCPattern is sufficient, it only mentioned what is checked. Documenting the intended pattern would help identifying if a check has been forgotten. E.g. for the statement domain.

I've added a description of the TC-like kernel, which is the intended pattern, to the doc for isTCPattern function. I've added additional remarks according to your comments and restrictions of the current implementation.

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.

ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };

What do you mean by "codegen modifies some memory accesses"? Polly's Codegen?

Sorry, I meant ScopBuilder. In the following case

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; ++l)
         if (l != 0)
           for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

ScopBuilder generates the following memory accesses, which correspond to the pattern:

{ Stmt4[i0, i1, i2, i3] -> MemRef0[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt4[i0, i1, i2, i3] -> MemRef3[o0, o1, o2] : o0 = i3 and o1 = i1 and o2 = i2 }
{ Stmt4[i0, i1, i2, i3] -> MemRef2[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = i3 }
{ Stmt4[i0, i1, i2, i3] -> MemRef0[o0, o1] : o0 = i0 and o1 = i1 }

If we changes that code a bit

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

ScopBuilder generates the following memory accesses:

{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }

In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

What is the check to exclude this?

They will be rejected by isTCOperandAcc. If we fix the output dimensions, values of output dimensions will not form a permutation of a subset of values of input dimensions. Please see comments inside this function.

Is the 1 + i3 memory access expression?

Yes, I believe so.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

domain: "{ Stmt3[i0, i1, i2, i3] : 0 <= i0 <= 1023 and 0 <= i1 <= 1023 and 0 <= i2 <= 63 and 0 <= i3 <= 62 }"

4

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];

}

For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.

Should or should not be detected?

That should not be detected, because the intersection of free and contracted indices should always be empty. We check this at the line "if (intersect(IandJIndexSet, TCI.P).size() != 0)".

While debugging, it is now rejected because of this:
``

if (!TCI.B) {
  // IndexSet should be a union of J and P sets.
  if (unite(TCI.P, TCI.J) != IndexSet)
    return false;

``

You’re right. Thank you. "if (intersect(IandJIndexSet, TCI.P).size() != 0)" doesn’t help in this case. In that example, we have dependencies of the form:

{ Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : (o0 = i0 and o1 = i1 and o2 = i2 and o3 = 1 + i3 and i0 >= 0 and i0 <= 31 and i1 >= 0 and i1 <= 31 and i2 >= 0 and i2 <= 31 and i3 >= 0 and i3 <= 30) or (i3 = 31 and o0 = i0 and o1 = i1 and o2 = 1 + i2 and o3 = 0 and i0 >= 0 and i0 <= 31 and i1 >= 0 and i1 <= 31 and i2 >= 0 and i2 <= 30) }

the isl ast has the form:

{ domain: "{ Stmt3[i0, i1, i2, i3] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 and 0 <= i3 <= 31 }", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i0)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i1)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i2)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i3)] }]" } } } } } } } } }
if (1 && (&MemRef5[31][31][32] <= &MemRef0[0][0] || &MemRef0[31][32] <= &MemRef5[0][0][0]) && (&MemRef4[31][31][32] <= &MemRef0[0][0] || &MemRef0[31][32] <= &MemRef4[0][0][0]))

    // Loop with Metadata
    for (int c0 = 0; c0 <= 31; c0 += 1) {
      // Loop with Metadata
      for (int c1 = 0; c1 <= 31; c1 += 1) {
        // Loop with Metadata
        for (int c2 = 0; c2 <= 31; c2 += 1) {
          // Loop with Metadata
          for (int c3 = 0; c3 <= 31; c3 += 1)
            Stmt3(c0, c1, c2, c3);
        }
      }
    }

else
    {  /* original code */ }

Consequently, only "l" and "w" are treated as "contracted indices", which are stored in TCI.P. Sorry, I missed that.

If indexes of an operand of the tensor contraction don’t contain TCI.P, we don't accept the program. We check this in lines

…
if (!isSuperset(IndexSet, TCI.P))
      return false;
…

…
if (unite(TCI.P, TCI.J) != IndexSet)
      return false;
…

unite(TCI.P, TCI.J) still isn't equal to IndexSet in the considered case, because IndexSet doesn't contain the "l".

If we didn't have the "l-loop", unite(TCI.P, TCI.J) still would not be equal to IndexSet, which would contain i, j and w, because TCI.I would contain only i and TCI.J would contain only j.

test/ScheduleOptimizer/pattern-matching-based-opts_21.ll is the corresponding test case. Additionally, I've added test/ScheduleOptimizer/pattern-matching-based-opts_25.ll.

P.S.: I had to use the -fno-unroll-loops option of clang. Otherwise, one of the loops is optimized out on my machine.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

[1] - Gareev R., Grosser T., Kruse M. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach ACM Transactions Architecture and Code Optimization (TACO). 2018. Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.
[2] - Matthews D. High-Performance Tensor Contraction without BLAS
SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. C 1—C 24. DOI: 110.1137/16m108968x.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.

I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

(I don't whether this is correct, I want to understand whether the checked conditions are sufficient).

> 5
> 
> I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

Ok.
polly/lib/Transform/MatmulOptimizer.cpp
1246

I think we can use llvm::replace to avoid clearing the vector and preserve the logic.

Meinersbur added a comment.EditedJun 29 2022, 9:01 PM
void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

ScopBuilder generates the following memory accesses:

{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }

In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

Looks like some other optimization (maybe JumpThreading?) modifies the loop range. Ideally, the detection would be robust enough to not depend on the whether the domain space has an offset.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

In a paper the names must be shorter to fit on the page and a described close to the figure. It also should not be necessary to have access to the paper to understand the algorithm. For narrowly scoped variables such as loop counters single letters might be ok because the definition is likely on the same screen (or for a paper: the same page), but globals should be more identifiable. paper [1] also does not make the connection to the symbols it uses and what they correspond to in Polly's data structures.

However, I don't request such a change atm.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.

I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

It is also 'redundant' with the code itself, but it helps understanding it.

The description of I and P are "Input dimensions of the schedule space, which represent free indices of tensors." One has to know what "free indices" are, indices of what, etc.

The "definition" of isTCPattern is declarative, does not even explain what all those symbols are, and refers to our matmul paper which does not apply because it only is a special case of the TC pattern.

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

This is exactly the sort of thin I would want to clarify.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

It corresponds to a filter node in the TC body. You have used the if (w != 0) to illustrate where the access function deviates from the usually pattern which implied that it would be something you would like to support. If not, that would not be part of the pattern.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

f is not necessarily a function call, but as mentioned a "operation" representing the calculation done in the the TC body.

Side-effect here means something different than ScopDetection. A write to an unrelated array D would be a side-effect for the TC, but accurately represented by a Scop.
We could allow unknown side-effects in polly in the future with a general "memory" dependency, an extension to what Polly already does with -polly-allow-modref-calls. These could just not be reordered relative to each other.

I really think the documentation should be better. I had a hard time fixing bugs in the matmul optimization just with understanding what the code is supposed to be doing after a long time and would prefer to no repeat that again. See rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9 and rGa56bd7dec8da4348d847d53c96d8a30f4a821d36.

polly/lib/Support/ISLTools.cpp
266 ↗(On Diff #436206)

nice

polly/lib/Transform/MatmulOptimizer.cpp
1219

The concern is that I can modify what isLatestArrayKind() returns by simply importing a JScop. The continue just ignores such weirdness but I think it is safer to fail in this case.

You yourself mention that scalar accesses are likely not useful, so why not fail when one is found instead (return null instead of continue)? Some exceptions may be possible, such as read-only scalars (VirtualUse::ReadOnly, VirtualUse::Synthesizable)

1465–1470

There is a check Intersection.is_empty() which is going to detect if a dependency is completely missing. But what detects that only some of the dependencies are present. Such as:

[p] -> { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and p != 0 }

or

{ Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and i0 < 42 }

(assuming contracting over i=0)

isReductionCarriedOverDim doesn't seem to check whether the dependency is over the complete domain either.

1613

Yes, that would be great.

1619
1622

I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."

Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
1–3

👍

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

ScopBuilder generates the following memory accesses:

{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }

In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

Looks like some other optimization (maybe JumpThreading?) modifies the loop range. Ideally, the detection would be robust enough to not depend on the whether the domain space has an offset.

I agree. Improving the detection is a possible goal of a future work.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

In a paper the names must be shorter to fit on the page and a described close to the figure. It also should not be necessary to have access to the paper to understand the algorithm. For narrowly scoped variables such as loop counters single letters might be ok because the definition is likely on the same screen (or for a paper: the same page), but globals should be more identifiable. paper [1] also does not make the connection to the symbols it uses and what they correspond to in Polly's data structures.

However, I don't request such a change atm.

Ok. I've tried to make the description of the algorithm self-consistent.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.

I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

It is also 'redundant' with the code itself, but it helps understanding it.

The description of I and P are "Input dimensions of the schedule space, which represent free indices of tensors." One has to know what "free indices" are, indices of what, etc.

The "definition" of isTCPattern is declarative, does not even explain what all those symbols are, and refers to our matmul paper which does not apply because it only is a special case of the TC pattern.

I've tried to improve the description.

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

This is exactly the sort of thin I would want to clarify.

We don't check for associativity, because it's difficult and not necessary for the optimization. The optimization doesn't change the order of loops with indexes from the bundle P during the optimization, even if you parallize the outermost loop. Hence, it doesn't violate anything. For the same reason, we don't check for associativity in the case of the optimization of the generalization of matrix-matrix multiplication, which is currently used in Polly.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

It corresponds to a filter node in the TC body. You have used the if (w != 0) to illustrate where the access function deviates from the usually pattern which implied that it would be something you would like to support. If not, that would not be part of the pattern.

Ok. Unfortunately, the current approach does't support this. So, it's not the part of the pattern.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

f is not necessarily a function call, but as mentioned a "operation" representing the calculation done in the the TC body.

Side-effect here means something different than ScopDetection. A write to an unrelated array D would be a side-effect for the TC, but accurately represented by a Scop.
We could allow unknown side-effects in polly in the future with a general "memory" dependency, an extension to what Polly already does with -polly-allow-modref-calls. These could just not be reordered relative to each other.

I agree that only side-effect free operations are considered in the pattern. Nevertheless, I propose not to use terms that may require an additional specification. I've tried to improve the description of the pattern.

I really think the documentation should be better. I had a hard time fixing bugs in the matmul optimization just with understanding what the code is supposed to be doing after a long time and would prefer to no repeat that again. See rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9 and rGa56bd7dec8da4348d847d53c96d8a30f4a821d36.

Sure. Let's continue improving it.

gareevroman marked 4 inline comments as done.Jul 29 2022, 11:56 PM
gareevroman added inline comments.
polly/lib/Transform/MatmulOptimizer.cpp
1219

I've added such a return statement to avoid scalar write memory accesses. Sorry, I was wrong. We need scalar read memory accesses. For example, in the case of the following matrix-matrix multiplication, a SCoP statement, which represents the body of the loop, contains the constant alpha.

C = alpha*A*B

Could we accept non-partial scalar read memory accesses? I think this is legal.

1465–1470

Are dependencies determined by the form of memory accesses? In the isCorrectAccessMap function, we check that memory accesses aren't partial. Isn't it sufficient?

I've tried to check whether the dependency is over the complete domain though.

1613

Ok. I've left that TODO comment.

1619

Looks like I missed that. Sorry. I will fix it in the next patch.

1622

I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."

I've added it it.

Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)

Sequence nodes could be necessary, if DeLICM was applied. Please, see the example inside the isTCPattern. Yes, I think other types except for marker nodes should be rejected. Additionally, as a precaution, I propose to check that a filter node has only a sequence and a domain nodes as its predecessors. I've updated the patch.

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
1–3

Ok. I've added this to my TODO list.

Meinersbur accepted this revision.Aug 2 2022, 11:55 AM

Thank you Gareev. I think the description can still be improved, I but we should also move forward and can improve iteratively.

Looking forward for the actual TC optimization.

polly/lib/Transform/MatmulOptimizer.cpp
193

AFAIU multiplication by β is not part of this detection, but required to be loop-distributed by the isl scheduler.

1141
1284

It does not require DeLICM, but -polly-allow-nonaffine-branches (which is enabled by default)

1614

[typo]

1695

What is Goto here? GotoBLAS?

This revision is now accepted and ready to land.Aug 2 2022, 11:55 AM
This revision was automatically updated to reflect the committed changes.
gareevroman marked 2 inline comments as done.
gareevroman marked 6 inline comments as done.Aug 7 2022, 4:27 AM

Thank you Gareev. I think the description can still be improved, I but we should also move forward and can improve iteratively.

Looking forward for the actual TC optimization.

Thanks! I've tried to address new comments in the committed patch.

polly/lib/Transform/MatmulOptimizer.cpp
193

Yes, it's not. I've added a comment about this.

1284

If I'm not mistaken, in your example the form of the dependencies doesn't correspond to the pattern.

c = C[i][j];
if (/*non-affine condition*/) {
  A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}

MayWrite: { Stmt_for_body8TOfor_inc[i0, i1, i2] -> MemRef_C[i0, i1] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }

I've added a slightly modified version of it to polly/test/ScheduleOptimizer/pattern-matching-based-opts_23.ll. It produces a region statement too.

for (int i = 0; i < 32; i++)
  for (int j = 0; j < 32; j++)
    for (int k = 0; k < 32; k++) {
      int c = C[i][j];
      if (i*j*k < 10) {
        C[i][j] = A[i][k] + B[k][j];
      } else {
        C[i][j] = c;
      } 
}

However, it introduces store merge phi nodes. It makes DeLICM necessary.

Statements {
	Stmt_for_body8__TO__if_end
        Domain :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> [i0, i1, i2, 0] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_A[i0, i2] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_B[i2, i1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
	Stmt_if_end
        Domain :=
            { Stmt_if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_if_end[i0, i1, i2] -> [i0, i1, i2, 1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
Meinersbur added inline comments.Aug 8 2022, 3:53 PM
polly/lib/Transform/MatmulOptimizer.cpp
1284

The storemerge PHI node is introduced by one of the canonicalization passes (in this case: InstCombine). It it possible to not run that pass, disable the matching of this particular pattern, or use some other trick to not be matched. In any case, we cannot rely on the InstCombine to happen. It might be safer to bail out if any RegionStmt is encountered.

gareevroman marked an inline comment as done.Aug 14 2022, 3:39 AM
gareevroman added inline comments.
polly/lib/Transform/MatmulOptimizer.cpp
1284

I see. I haven't managed to fix that test case. So, I've decided to remove it. I've left the check that bails out if any RegionStmt is encountered in containsOnlyTCAcc.