This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
polly/
-
lib/Transform/
-
Transform/
70/88
MatmulOptimizer.cpp
1/1
ScheduleOptimizer.cpp
-
test/ScheduleOptimizer/
-
ScheduleOptimizer/
1/1
pattern-matching-based-opts-after-delicm.ll
3/4
pattern-matching-based-opts-after-delicm_2.ll
-
pattern-matching-based-opts.ll
-
pattern-matching-based-opts_11.ll
-
pattern-matching-based-opts_15.ll
-
pattern-matching-based-opts_16.ll
-
pattern-matching-based-opts_17.ll
-
pattern-matching-based-opts_18.ll
-
pattern-matching-based-opts_19.ll
-
pattern-matching-based-opts_2.ll
-
pattern-matching-based-opts_20.ll
-
pattern-matching-based-opts_4.ll

Differential D114336

[Polly] Generalize the pattern matching to the case of tensor contractions.
ClosedPublic

Authored by gareevroman on Nov 21 2021, 6:10 AM.

Download Raw Diff

Details

Reviewers

Meinersbur
bollu

Commits

rGb02c7e2b630a: [Polly] Generalize the pattern matching to the case of tensor contractions

Summary

The pattern matching optimization of Polly detects and optimizes dense general matrix-matrix multiplication. The generated code is close to high performance implementations of matrix-matrix multiplications, which are contained in manually tuned libraries [1]. The described pattern matching optimization is a particular case of tensor contraction optimization, which was introduced in [2].

This patch generalizes the pattern matching to the case of tensor contractions using the algorithm described in [2]. Following the ideas introduced in [3], it logically represents tensor contraction operands as matrix multiplication operands and uses the approach presented in [1].

Optimization of tensor contractions will be added in the next patch. These modifications can be found in https://github.com/gareevroman/llvm-project.

[1] - Low T.M., Igual F.D., Smith T.M., Quintana-Orti E.S. Analytical Modeling Is Enough for High-Performance BLIS // ACM Transactions on Mathematical Software. 2016. Vol. 43, no. 2. P. 12:1—12:18. DOI: 10.1145/2925987.

[2] - Gareev R., Grosser T., Kruse M. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach // ACM Transactions on Architecture and Code Optimization (TACO). 2018. Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.

[3] - Matthews D. High-Performance Tensor Contraction without BLAS // SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. C 1—C 24. DOI: 110.1137/16m108968x.

Diff Detail

Event Timeline

gareevroman created this revision.Nov 21 2021, 6:10 AM

Herald added a reviewer: bollu. · View Herald TranscriptNov 21 2021, 6:10 AM

Herald added a subscriber: asbirlea. · View Herald Transcript

gareevroman requested review of this revision.Nov 21 2021, 6:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 21 2021, 6:10 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B135309: Diff 388756.Nov 21 2021, 6:23 AM

Thanks for upstreaming your tensor optimization!

The naming of polly-pattern-matching-based-opts suggests that it includes all pattern-based optimizations, yet this introduces another flag polly-pattern-matching-based-tc-opts. I'd prefer polly-pattern-matching-based-opts controlling both optimizations, and then additional flags for enabling matrix-multiplication and tensor optimizations. Alternatively, rename polly-pattern-matching-based-opts to e.g. polly-matmul-opt. Also, you are adding this functionality into a file called MatmulOptimizer and a function called tryOptimizeMatMulPattern. Since matrix-multiplication is a tensor contraction, is ts-opt supposed to supperseed matrix multiplication? In short, I would like to know what what the relation between those two optimizations should be. I'd prefer to not have to maintain two optimizations if one is strictly more powerful than the other.

I'd enjoy some occasional comments and grouping of statements (empty lines) inside the functions in addition to the doxygen comments. For instance isTCOperandAcc is just a wall of code. For such property-checking functions, ideally each return should mention what property is violated here and why this property is required to be compatible with tensor optimization.

polly/lib/Transform/MatmulOptimizer.cpp
180–181	Please add more details on what the members represent.
187–189	`std::set` is a high-overhead implementation. Consider using `DenseSet` or `SmallDenseSet`. See https://www.llvm.org/docs/ProgrammersManual.html#llvm-adt-denseset-h
190–196	Is there an argument to use 30 and small size? If not, consider using just `SmallVector<int>`.
1091	[style] No reason to make this a wide string literal, especially if just used as an assertion failed message. Apples to other occurrences as well.
1110
1116	or introduce `intFromIslSize`.
1149–1150	`SmallVectorImpl` is not specific to what the vector's small size is.
1160	Consider using `polly::rangeIslSize` for iterating over dimensions.
1187	Although already in an anon namespace, the other methods add `static` as well. I found it helps the compiler to warn if a static function is unnused.
1189–1190	Do you know of `#include <llvm/ADT/SetOperations.h>`? Unfortunately, these modify one set rather than returning a new set.
1219
1229–1230	This computes whether two sets a disjoint, it should not be required to compute the intersection.
1246
1284	`getAccessesInOrder` requires `Stmt` to not be a RegionStmt. Please add a test for it.
1294	If any of the returns are executed, what causes the pattern to be rejected (it's not `return false`)?
1363–1364
1365–1367	The check should not depend on `n_basic_set`, which is fragile and depends on whether on eg. `coalesce` was successful. Consider using something like `polly::getConstant`.
1411–1412	Consider `lexmin_pw_multi_aff`/`lexmax_pw_multi_aff`
polly/lib/Transform/ScheduleOptimizer.cpp
459–463	Instead of modifying the idea of whether a node is tilable, consider introducing another constraint-checking function, as we should have done with prevectorization as well.

Thank you very much for the review!

I've tried to address all comments. Additionally, I've updated the optimization for the case of filter nodes (e.g., polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll). I believe that the optimization of tensor contractions is strictly more powerful than the optimization of matrix-multiplications. So, I'd suggest to replace the optimization of matrix-multiplications with it eventually.

gareevroman marked 19 inline comments as done.Dec 12 2021, 1:30 AM

gareevroman added inline comments.

polly/lib/Transform/MatmulOptimizer.cpp
1116	As far I understand, Dimensions.size() returns a value of type size_t instead of a value of the type isl_size. So, in the new version I used the unsigned type to avoid the cast.
1229–1230	That check is redundant. Thanks.
1246	As far as I understand, we cannot do this here because of the assignment to TCI.ADimensions and TCI.BDimensions
1284	I’ve added a check to containsOnlyTCAcc. Could you clarify how the test case should look like? Should it be a region statement that contains a matrix multiplication with right order of memory accesses?
1365–1367	I think that this check was redundant. I’ve removed it.

gareevroman marked 5 inline comments as done.Dec 12 2021, 1:31 AM

Harbormaster completed remote builds in B138850: Diff 393734.Dec 12 2021, 1:41 AM

ping

I am extremely sorry for the late review, I hope you could use the time to work on/polish the follow-up patch. Some things are a bit hard to understand independently. I promise to review in a more timely manner from now on, although I might always add that many comments.

Please add to the tests cases what they are supposed to be testing and maybe give them more meaningful filenames.

The following is successfully detected as tensor contraction. Is this intended?

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; ++l)
              if (l != 0)
                for (int w = 0; w < 64; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

It might be if the codegen part is able exclude the element 0. In contrast, this one is rejected:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; l++)
             for (int w = 0; w < 64; ++w)
                if (w != 0)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

or this:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; l+=2)
             for (int w = 0; w < 64; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
     for (int i = 0; i < 32; i++)
         for (int j = 0; j < 32; j++)
           for (int l = 0; l < 32; l++)
             for (int w = 0; w < 32; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][i+3];
}

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
    for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
    for (int l = 0; l < 32; l++)
    for (int w = 0; w < 32; w++)
        C[i][j] += A[i][l][w] * B[w][j][i];
}

Is it possible to make it work with -polly-reschedule=0 as well? AFAICS there is nothing that requires rescheduling and would allow us to test the detection independently. isTCPattern seems to already consider multiple bands; the hurdle at the moment seems to be that the bands are not marked as permutable.

polly/lib/Transform/MatmulOptimizer.cpp
1590–1591	should not be necessary; any permutation of the surrounding loops can be valid. Eg, for (w = 0; w < 64; ++w) for (l = 0; l < 64; ++l) for (i = 0; i < 1024; i++) for (j = 0; j < 1024; j++) C[i][j] += A[i][l][w] * B[w][j][l]; yields the same result.
1602	Prefer `Node.isa<isl::schedule_node_leaf>()` (and then typed subclass: `Node.as<isl_schedule_node_leaf>()`)

Meinersbur added inline comments.Mar 28 2022, 6:37 PM

polly/lib/Transform/MatmulOptimizer.cpp
183–186	Please use doxygen comments to describe class/struct members. `@{ @}` can be used to group them. Also add an empty line before a new comment begins.
1116	`rangeIslSize` should make it easier.
1136	I like the idea of verifying the correctness by reconstructing and comparing to the original. Maybe do it at the end to verify that the entire `TCInfoTy` is correct? On the other size, earlier fail would be better. What do you think?
1161–1164	This is a weird way to find out which indices map to what other index, I guess the equivalent of `isMatMulOperandAcc`. It requires that the dimension number if part of the AccMap's range, and if there is any expression will fail (eg `Stmt[i][j] -> A[i-1][j+1]` matches the wrong dimensions), but at least there is the verification afterwards. I am not sure I like this sort of cleverness; I'd rather expect some sort of introspection into the map's coefficients, but I also think this should work in nearly all relevant cases and should be save due to the verification, so lets keep it. However, please document this better, eg. add an example on what is expected to happen.
1171–1172	The "plain" function are unfortunately not very robust, eg its result is different depending on the internal representation. I'd suggestion `getConstant` (from ISLTools) but only takes an pw_aff. Could you extract uses of `plain_get_val_if_fixed` into such a function, and mark it as TODO to cope with it later?
1219	I can use JScopImport to set a scalar memory access to a partial write without adding additional dependencies; that is, I don't think this can be just ignored. I suggest to have a single function that calls `getAccessesInOrder` and sort out which MemAccess is read/write in there, then analyze them.
1250–1252	Introduce a `is_superset` (etc) call?
1270–1272	Could we add utility functions such that this becomes `unite(J, P) == IndexSet`?
1284	Test in `containsOnlyTCAcc` is exactly what I was looking for. A region statement could look like this: c = C[i][j]; if (/non-affine condition/) { (void)A[i][k] + B[k][j]; } else { C[i][j] = c; } which has the correct order of accesses but is obviously not what we are looking for.
1374
1395	This seems to check setermine whether there is a reduction (contraction) carried by loop number `Pos`. The function name could be more meaningful. Suggestion: `isDepCarryingReductionOverDim` (not nice, but "TcDep" can mean anything)
1396	Consider passing by const-reference.
1399	`plain_get_val_if_fixed` is not really robust as it depends on the internal representation that can be different after eg simplify. Since this just checks to a specific value, the best would be to create a new set where all the fixed dimensions are that value (here: 1), and check whether `DepDelta` is a subset of it.
1410–1412	Why not `BoundDeltas.subtract()` instead of deltas?
1451–1454	Should we also check whether WAW, RAW dependences are incompatible?
1460–1461	lexmin/lexmax can be expensive. Wrap into a `IslMaxOperationsGuard`?
1462	You seem to assume an functional relationship from here on. If that's the case, you can keep the type a `pw_multi_aff` which supports more functions that you may have missed such as `pw_multi_aff::add`
1464	Consider using `reverse(rangeIslSize(0,DeltasDimNum))` (from ISLTools.h).
1465–1470	This is going to check whether each element out of `Intersection` is a contraction over dimension `i`. Don't we also need to check that every iteration out of the band `i` is contributing to that contraction?
1574	This is not part of the pattern detection, but the optimization. Could we move it to the patch that does the actual optimization?
1586–1587	Could you describe here what those 4 accesses are?
1592	Could you add a high-level description how the algorithm actually works? I.e. dependencies used to determine contraction dimensions, etc.
1607–1610	This condition is effectively identical to the next
1613	This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle for now.
1616–1617	[style] No Almost-Always-Auto in LLVM's coding style.
1622	This looks for the outermost node that is not a filter or band. Is it possible that while that outermost node is not a TC contraction, one of the inner ones might? What if the outermost node is a filter, looks like it would just `return false` in this case.
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll
13
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
1–3	Since this is not FileCheck-ing the LLVM-IR output, suppress it with `-disable-output`

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2022, 6:37 PM

Thank you very much for the review! I am sorry for the late response. I will try to to address all your comments within the next few weeks.

In D114336#3470843, @gareevroman wrote:

Thank you very much for the review! I am sorry for the late response. I will try to to address all your comments within the next few weeks.

No worries, I wasn't responsive either :-(

gareevroman updated this revision to Diff 428082.May 9 2022, 7:44 AM

Herald added subscribers: ormris, steven_wu, hiraditya. · View Herald TranscriptMay 9 2022, 7:44 AM

Harbormaster completed remote builds in B163485: Diff 428082.May 9 2022, 7:55 AM

The following is successfully detected as tensor contraction. Is this intended?

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {

for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; ++l)
         if (l != 0)
           for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];

}

Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

It might be if the codegen part is able exclude the element 0. In contrast, this one is rejected:

In this case, the codegen excludes the element 0 for i2. I added a test case for this.

domain: "{ Stmt4[i0, i1, i2, i3] : 0 <= i0 <= 1023 and 0 <= i1 <= 1023 and 0 < i2 <= 63 and 0 <= i3 <= 63 }"
…

In contrast, this one is rejected:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {

for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];

}

or this:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l+=2)
        for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.

ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
      for (int l = 0; l < 32; l++)
        for (int w = 0; w < 32; ++w)
             C[i][j] += A[i][l][w] * B[w][j][i+3];

}

I fixed the isTCOperandAcc function and checked that all other asserts are used properly.

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
      for (int l = 0; l < 32; l++)
        for (int w = 0; w < 32; ++w)
             C[i][j] += A[i][l][w] * B[w][j][i+3];

}

I fixed the isTCOperandAcc function and checked that all other asserts are used properly.

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];
}

For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.

1347 static bool containsOnlyTCAcc(isl::set Domain, isl::map PartialSchedule,

                            TCInfoTy &TCI) {
...

1365 if (intersect(IandJIndexSet, TCI.P).size() != 0)
1366 return false;

I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

polly/lib/Transform/MatmulOptimizer.cpp
1116	I think rangeIslSize can’t be used in this case. However, I’ve tried to use rangeIslSize to improve the patch.
1136	Other parts of TCInfoTy are verified in isTCOperandAcc too. I think that it would be better to verify the related information in one place as much and as early as possible. Probably, the earlier fail would simplify the debugging, since we exactly know the form of memory accesses and can rely on it. Additionally, the performance can be improved, since the earlier fail helps to avoid additional operations with sets.
1219	I am not sure whether modifications of implementations of tensor contractions, which contain read and write scalar memory accesses, are useful in practice. Moreover, since bundles of induction variables I, J, P can contain an unlimited number of dimensions, we possibly cannot follow the algorithm from the containsOnlyMatrMultAcc function, which permutes dimensions and checks that additional memory accesses have stride 0 in terms of dimensions MMM.i, MMM.j, and MMM.k. Consequently, such memory accesses can be treated as scalar memory accesses. I have not come up with an effective alternative yet. That is why I do not consider scalar memory accesses in getWriteAccess and setReadAccesses functions. Could we mark it as TODO and do it future?
1284	Thanks for the example! I have added a corresponding test case. If I am not mistaken, it requires DeLICM.
1395	Could we name it isReductionCarriedOverDim? I think, in this case, we should rename the parameter Pos to Dim to make it more readable.
1410–1412	As far as I understand, these operations are not equal. deltas computes a set containing the differences between image elements and corresponding domain elements in the input. subtract computes a subtraction of sets. For example, in the case of the following sets they compute the following: BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31] } isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, 0, 0, -1]} BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]} deltas: {Stmt_for_body15[31, 31, 31, 31, 31, 32]} BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31]} isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, -1, 0, 31]} BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]} deltas: {Stmt_for_body15[31, 31, 31, 32, 31, 0]} These comment interferes with the comment about pw_multi_aff. Consequently, I replaced the usage of isl_map_deltas with operations on pw_multi_aff.
1451–1454	As far as I understand, that is not necessary, because subsequently we check that the statement has the form C(shuffle(I, J)) = E(A(shuffle(I, P)),B(shuffle(P, J))C(shuffle(I, J))), where E is an expression that contains reads from the tensors A, B, C, and an arbitrary number of reads from constants with respect to bundles I, J , and P. I have added a comment that describes this. "The form of anti and output dependencies is determined specified by the form the SCoP statement, which is checked by subsequent analysis."
1460–1461	What is the maximal amount of computational steps we should use by default? I set it to 500000 according to DependenceInfo.cpp.
1465–1470	Could you clarify what do you mean by the band i? Are these indexes ki, which describe the dependencies? isTcDep checks that the dependency has the form /// S(..., ki, max(k(i + 1)), ..., max(kn), ...) -> S(..., ki + 1, min(k(i + 1)), ..., min(kn), …)
1602	Could we factor out this condition into ScheduleTreeOptimizer::isPMOptimizableBandNode, since it is common for isTCPattern and isMatrMultPattern functions? A new version of the patch shows how it could look like.
1613	Could we add a TODO comment for this?
1622	If I am not mistaken, this only checks that all band nodes, which represent the statement, are not split by filter nodes. These accepts a straightforward implementation of TC with/without delicm. For example, domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and 0 <= i1 <= 1799 and 0 <= i2 <= 2199; Stmt_for_body3[i0, i1] : 0 <= i0 <= 1599 and 0 <= i1 <= 1799; Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and 0 <= i1 <= 1799 }" child: sequence: - filter: "{ Stmt_for_body3[i0, i1] }" child: schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]" permutable: 1 coincident: [ 1, 1 ] - filter: "{ Stmt_for_body3_last[i0, i1] }" child: schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]" permutable: 1 coincident: [ 1, 1 ] - filter: "{ Stmt_for_body8[i0, i1, i2] }" child: schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] }, { Stmt_for_body8[i0, i1, i2] -> [(i1)] }, { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]" permutable: 1 coincident: [ 1, 1, 0 ] domain: "{ Stmt2[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }" child: schedule: "[{ Stmt2[i0, i1, i2] -> [(i0)] }, { Stmt2[i0, i1, i2] -> [(i1)] }, { Stmt2[i0, i1, i2] -> [(i2)] }]" permutable: 1 coincident: [ 1, 1, 0 ] Sorry, I have not committed an updated version of the optimization of TC to my github repo. However, I believe that, if this is that case, we can safely replace all such nodes. + auto NodeType = isl_schedule_node_get_type(Node.get()); + while ((NodeType != isl_schedule_node_domain) && + (NodeType != isl_schedule_node_filter)) { + assert((NodeType != isl_schedule_node_sequence) && + L"Prevent the undefined behavior"); + Node = Node.parent(); + NodeType = isl_schedule_node_get_type(Node.get()); + } + Node = Node.child(0); + Node = isl::manage(isl_schedule_node_cut(Node.release())); + return Node.insert_partial_schedule(Dimensions); I think taht the detection of a more sophisticated implementations of TC is a possible goal of a future research. I have described this in the comment.
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
1–3	Could we fix the existing test cases in a separate patch? polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s ; REQUIRES: asserts polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s polly/test/ScheduleOptimizer/pattern-matching-based-opts_17.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s polly/test/ScheduleOptimizer/pattern-matching-based-opts_18.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s polly/test/ScheduleOptimizer/pattern-matching-based-opts_19.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s polly/test/ScheduleOptimizer/pattern-matching-based-opts_20.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s

ormris removed a subscriber: ormris.May 16 2022, 10:58 AM

In D114336#3500858, @gareevroman wrote:

1
Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

Please document what pattern is intended to be recognized. I don't think the doc for isTCPattern is sufficient, it only mentioned what is checked. Documenting the intended pattern would help identifying if a check has been forgotten. E.g. for the statement domain.

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.

ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };

What do you mean by "codegen modifies some memory accesses"? Polly's Codegen? What is the check to exclude this? Is the 1 + i3 memory access expression? Where does it come from?

4

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?
void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];
}
For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.

Should or should not be detected?

While debugging, it is now rejected because of this:
``

if (!TCI.B) {
  // IndexSet should be a union of J and P sets.
  if (unite(TCI.P, TCI.J) != IndexSet)
    return false;

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J. `op=` is a commutative operation ...., c is an affine condition usually just `true`. `f` is a side-effect free operation.

(I don't whether this is correct, I want to understand whether the checked conditions are sufficient).

5

I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

Ok.

polly/lib/Transform/MatmulOptimizer.cpp
204	`@{` is not needed when documenting just a single member.
1246	The assignments should just make a copy of the array . With `Dimensions` being passed by-value, the caller has to make the copy which it should not need to. `SmallVector` has no overload for being assigned an `ArrayRef`, but you could use `llvm::append_range` to insert all the values.
1291	Compiler warning: /home/meinersbur/src/llvm-project/polly/lib/Transform/MatmulOptimizer.cpp:1310:13: warning: moving a temporary object prevents copy elision [-Wpessimizing-move] TCI.I = std::move(set_difference(IndexSet, TCI.P)); The result of `set_difference` is already an r-value, no need to cast it to an r-value.
1296	Same compiler warning.
1395	Sounds ok.

gareevroman updated this revision to Diff 436206.Jun 12 2022, 3:58 AM

In D114336#3517540, @Meinersbur wrote:

In D114336#3500858, @gareevroman wrote:

1
Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

Please document what pattern is intended to be recognized. I don't think the doc for isTCPattern is sufficient, it only mentioned what is checked. Documenting the intended pattern would help identifying if a check has been forgotten. E.g. for the statement domain.

I've added a description of the TC-like kernel, which is the intended pattern, to the doc for isTCPattern function. I've added additional remarks according to your comments and restrictions of the current implementation.

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
What do you mean by "codegen modifies some memory accesses"? Polly's Codegen?

Sorry, I meant ScopBuilder. In the following case

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; ++l)
         if (l != 0)
           for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

ScopBuilder generates the following memory accesses, which correspond to the pattern:

{ Stmt4[i0, i1, i2, i3] -> MemRef0[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt4[i0, i1, i2, i3] -> MemRef3[o0, o1, o2] : o0 = i3 and o1 = i1 and o2 = i2 }
{ Stmt4[i0, i1, i2, i3] -> MemRef2[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = i3 }
{ Stmt4[i0, i1, i2, i3] -> MemRef0[o0, o1] : o0 = i0 and o1 = i1 }

If we changes that code a bit

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

ScopBuilder generates the following memory accesses:

{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }

In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

What is the check to exclude this?

They will be rejected by isTCOperandAcc. If we fix the output dimensions, values of output dimensions will not form a permutation of a subset of values of input dimensions. Please see comments inside this function.

Is the 1 + i3 memory access expression?

Yes, I believe so.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

domain: "{ Stmt3[i0, i1, i2, i3] : 0 <= i0 <= 1023 and 0 <= i1 <= 1023 and 0 <= i2 <= 63 and 0 <= i3 <= 62 }"

4

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?
void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];
}
For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.
Should or should not be detected?

That should not be detected, because the intersection of free and contracted indices should always be empty. We check this at the line "if (intersect(IandJIndexSet, TCI.P).size() != 0)".

While debugging, it is now rejected because of this:
``
if (!TCI.B) {
  // IndexSet should be a union of J and P sets.
  if (unite(TCI.P, TCI.J) != IndexSet)
    return false;
``

You’re right. Thank you. "if (intersect(IandJIndexSet, TCI.P).size() != 0)" doesn’t help in this case. In that example, we have dependencies of the form:

{ Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : (o0 = i0 and o1 = i1 and o2 = i2 and o3 = 1 + i3 and i0 >= 0 and i0 <= 31 and i1 >= 0 and i1 <= 31 and i2 >= 0 and i2 <= 31 and i3 >= 0 and i3 <= 30) or (i3 = 31 and o0 = i0 and o1 = i1 and o2 = 1 + i2 and o3 = 0 and i0 >= 0 and i0 <= 31 and i1 >= 0 and i1 <= 31 and i2 >= 0 and i2 <= 30) }

the isl ast has the form:

{ domain: "{ Stmt3[i0, i1, i2, i3] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 and 0 <= i3 <= 31 }", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i0)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i1)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i2)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i3)] }]" } } } } } } } } }
if (1 && (&MemRef5[31][31][32] <= &MemRef0[0][0] || &MemRef0[31][32] <= &MemRef5[0][0][0]) && (&MemRef4[31][31][32] <= &MemRef0[0][0] || &MemRef0[31][32] <= &MemRef4[0][0][0]))

    // Loop with Metadata
    for (int c0 = 0; c0 <= 31; c0 += 1) {
      // Loop with Metadata
      for (int c1 = 0; c1 <= 31; c1 += 1) {
        // Loop with Metadata
        for (int c2 = 0; c2 <= 31; c2 += 1) {
          // Loop with Metadata
          for (int c3 = 0; c3 <= 31; c3 += 1)
            Stmt3(c0, c1, c2, c3);
        }
      }
    }

else
    {  /* original code */ }

Consequently, only "l" and "w" are treated as "contracted indices", which are stored in TCI.P. Sorry, I missed that.

If indexes of an operand of the tensor contraction don’t contain TCI.P, we don't accept the program. We check this in lines

…
if (!isSuperset(IndexSet, TCI.P))
      return false;
…

…
if (unite(TCI.P, TCI.J) != IndexSet)
      return false;
…

unite(TCI.P, TCI.J) still isn't equal to IndexSet in the considered case, because IndexSet doesn't contain the "l".

If we didn't have the "l-loop", unite(TCI.P, TCI.J) still would not be equal to IndexSet, which would contain i, j and w, because TCI.I would contain only i and TCI.J would contain only j.

test/ScheduleOptimizer/pattern-matching-based-opts_21.ll is the corresponding test case. Additionally, I've added test/ScheduleOptimizer/pattern-matching-based-opts_25.ll.

P.S.: I had to use the -fno-unroll-loops option of clang. Otherwise, one of the loops is optimized out on my machine.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

[1] - Gareev R., Grosser T., Kruse M. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach ACM Transactions Architecture and Code Optimization (TACO). 2018. Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.
[2] - Matthews D. High-Performance Tensor Contraction without BLAS SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. C 1—C 24. DOI: 110.1137/16m108968x.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.

I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

(I don't whether this is correct, I want to understand whether the checked conditions are sufficient).

> 5
> 
> I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

Ok.

polly/lib/Transform/MatmulOptimizer.cpp
1246	I think we can use llvm::replace to avoid clearing the vector and preserve the logic.

Harbormaster completed remote builds in B169299: Diff 436206.Jun 12 2022, 4:14 AM

In D114336#3576199, @gareevroman wrote:
void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}
ScopBuilder generates the following memory accesses:
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }
In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

Looks like some other optimization (maybe JumpThreading?) modifies the loop range. Ideally, the detection would be robust enough to not depend on the whether the domain space has an offset.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

In a paper the names must be shorter to fit on the page and a described close to the figure. It also should not be necessary to have access to the paper to understand the algorithm. For narrowly scoped variables such as loop counters single letters might be ok because the definition is likely on the same screen (or for a paper: the same page), but globals should be more identifiable. paper [1] also does not make the connection to the symbols it uses and what they correspond to in Polly's data structures.

However, I don't request such a change atm.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.

I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

It is also 'redundant' with the code itself, but it helps understanding it.

The description of I and P are "Input dimensions of the schedule space, which represent free indices of tensors." One has to know what "free indices" are, indices of what, etc.

The "definition" of isTCPattern is declarative, does not even explain what all those symbols are, and refers to our matmul paper which does not apply because it only is a special case of the TC pattern.

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

This is exactly the sort of thin I would want to clarify.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

It corresponds to a filter node in the TC body. You have used the if (w != 0) to illustrate where the access function deviates from the usually pattern which implied that it would be something you would like to support. If not, that would not be part of the pattern.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

f is not necessarily a function call, but as mentioned a "operation" representing the calculation done in the the TC body.

Side-effect here means something different than ScopDetection. A write to an unrelated array D would be a side-effect for the TC, but accurately represented by a Scop.
We could allow unknown side-effects in polly in the future with a general "memory" dependency, an extension to what Polly already does with -polly-allow-modref-calls. These could just not be reordered relative to each other.

I really think the documentation should be better. I had a hard time fixing bugs in the matmul optimization just with understanding what the code is supposed to be doing after a long time and would prefer to no repeat that again. See rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9 and rGa56bd7dec8da4348d847d53c96d8a30f4a821d36.

polly/lib/Support/ISLTools.cpp
266 ↗	(On Diff #436206)	nice
polly/lib/Transform/MatmulOptimizer.cpp
1219	The concern is that I can modify what `isLatestArrayKind()` returns by simply importing a JScop. The `continue` just ignores such weirdness but I think it is safer to fail in this case. You yourself mention that scalar accesses are likely not useful, so why not fail when one is found instead (`return null` instead of `continue`)? Some exceptions may be possible, such as read-only scalars (`VirtualUse::ReadOnly`, `VirtualUse::Synthesizable`)
1465–1470	There is a check `Intersection.is_empty()` which is going to detect if a dependency is completely missing. But what detects that only some of the dependencies are present. Such as: [p] -> { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and p != 0 } or { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and i0 < 42 } (assuming contracting over `i=0`) `isReductionCarriedOverDim` doesn't seem to check whether the dependency is over the complete domain either.
1613	Yes, that would be great.
1619
1622	I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes." Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
1–3	👍

gareevroman updated this revision to Diff 448777.Jul 29 2022, 11:43 PM

In D114336#3621094, @Meinersbur wrote:
In D114336#3576199, @gareevroman wrote:
void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}
ScopBuilder generates the following memory accesses:
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }
In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.
Looks like some other optimization (maybe JumpThreading?) modifies the loop range. Ideally, the detection would be robust enough to not depend on the whether the domain space has an offset.

I agree. Improving the detection is a possible goal of a future work.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

In a paper the names must be shorter to fit on the page and a described close to the figure. It also should not be necessary to have access to the paper to understand the algorithm. For narrowly scoped variables such as loop counters single letters might be ok because the definition is likely on the same screen (or for a paper: the same page), but globals should be more identifiable. paper [1] also does not make the connection to the symbols it uses and what they correspond to in Polly's data structures.

However, I don't request such a change atm.

Ok. I've tried to make the description of the algorithm self-consistent.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.
I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

It is also 'redundant' with the code itself, but it helps understanding it.

The description of I and P are "Input dimensions of the schedule space, which represent free indices of tensors." One has to know what "free indices" are, indices of what, etc.

The "definition" of isTCPattern is declarative, does not even explain what all those symbols are, and refers to our matmul paper which does not apply because it only is a special case of the TC pattern.

I've tried to improve the description.

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

This is exactly the sort of thin I would want to clarify.

We don't check for associativity, because it's difficult and not necessary for the optimization. The optimization doesn't change the order of loops with indexes from the bundle P during the optimization, even if you parallize the outermost loop. Hence, it doesn't violate anything. For the same reason, we don't check for associativity in the case of the optimization of the generalization of matrix-matrix multiplication, which is currently used in Polly.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

It corresponds to a filter node in the TC body. You have used the if (w != 0) to illustrate where the access function deviates from the usually pattern which implied that it would be something you would like to support. If not, that would not be part of the pattern.

Ok. Unfortunately, the current approach does't support this. So, it's not the part of the pattern.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

f is not necessarily a function call, but as mentioned a "operation" representing the calculation done in the the TC body.

Side-effect here means something different than ScopDetection. A write to an unrelated array D would be a side-effect for the TC, but accurately represented by a Scop.
We could allow unknown side-effects in polly in the future with a general "memory" dependency, an extension to what Polly already does with -polly-allow-modref-calls. These could just not be reordered relative to each other.

I agree that only side-effect free operations are considered in the pattern. Nevertheless, I propose not to use terms that may require an additional specification. I've tried to improve the description of the pattern.

I really think the documentation should be better. I had a hard time fixing bugs in the matmul optimization just with understanding what the code is supposed to be doing after a long time and would prefer to no repeat that again. See rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9 and rGa56bd7dec8da4348d847d53c96d8a30f4a821d36.

Sure. Let's continue improving it.

Harbormaster completed remote builds in B178388: Diff 448777.Jul 29 2022, 11:54 PM

gareevroman marked 4 inline comments as done.Jul 29 2022, 11:56 PM

gareevroman added inline comments.

polly/lib/Transform/MatmulOptimizer.cpp
1219	I've added such a return statement to avoid scalar write memory accesses. Sorry, I was wrong. We need scalar read memory accesses. For example, in the case of the following matrix-matrix multiplication, a SCoP statement, which represents the body of the loop, contains the constant alpha. C = alphaAB Could we accept non-partial scalar read memory accesses? I think this is legal.
1465–1470	Are dependencies determined by the form of memory accesses? In the isCorrectAccessMap function, we check that memory accesses aren't partial. Isn't it sufficient? I've tried to check whether the dependency is over the complete domain though.
1613	Ok. I've left that TODO comment.
1619	Looks like I missed that. Sorry. I will fix it in the next patch.
1622	I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes." I've added it it. Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok) Sequence nodes could be necessary, if DeLICM was applied. Please, see the example inside the isTCPattern. Yes, I think other types except for marker nodes should be rejected. Additionally, as a precaution, I propose to check that a filter node has only a sequence and a domain nodes as its predecessors. I've updated the patch.
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
1–3	Ok. I've added this to my TODO list.

Thank you Gareev. I think the description can still be improved, I but we should also move forward and can improve iteratively.

Looking forward for the actual TC optimization.

polly/lib/Transform/MatmulOptimizer.cpp
193	AFAIU multiplication by β is not part of this detection, but required to be loop-distributed by the isl scheduler.
1141
1284	It does not require DeLICM, but `-polly-allow-nonaffine-branches` (which is enabled by default)
1614	[typo]
1695	What is Goto here? GotoBLAS?

This revision is now accepted and ready to land.Aug 2 2022, 11:55 AM

Closed by commit rGb02c7e2b630a: [Polly] Generalize the pattern matching to the case of tensor contractions (authored by gareevroman). · Explain WhyAug 7 2022, 4:22 AM

This revision was automatically updated to reflect the committed changes.

gareevroman marked 2 inline comments as done.

gareevroman added a commit: rGb02c7e2b630a: [Polly] Generalize the pattern matching to the case of tensor contractions.

In D114336#3694323, @Meinersbur wrote:

Thank you Gareev. I think the description can still be improved, I but we should also move forward and can improve iteratively.

Looking forward for the actual TC optimization.

Thanks! I've tried to address new comments in the committed patch.

polly/lib/Transform/MatmulOptimizer.cpp

193

Yes, it's not. I've added a comment about this.

1284

If I'm not mistaken, in your example the form of the dependencies doesn't correspond to the pattern.

c = C[i][j];
if (/*non-affine condition*/) {
  A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}

MayWrite: { Stmt_for_body8TOfor_inc[i0, i1, i2] -> MemRef_C[i0, i1] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }

I've added a slightly modified version of it to polly/test/ScheduleOptimizer/pattern-matching-based-opts_23.ll. It produces a region statement too.

for (int i = 0; i < 32; i++)
  for (int j = 0; j < 32; j++)
    for (int k = 0; k < 32; k++) {
      int c = C[i][j];
      if (i*j*k < 10) {
        C[i][j] = A[i][k] + B[k][j];
      } else {
        C[i][j] = c;
      } 
}

However, it introduces store merge phi nodes. It makes DeLICM necessary.

Statements {
	Stmt_for_body8__TO__if_end
        Domain :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> [i0, i1, i2, 0] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_A[i0, i2] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_B[i2, i1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
	Stmt_if_end
        Domain :=
            { Stmt_if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_if_end[i0, i1, i2] -> [i0, i1, i2, 1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };

Meinersbur added inline comments.Aug 8 2022, 3:53 PM

polly/lib/Transform/MatmulOptimizer.cpp
1284	The `storemerge` PHI node is introduced by one of the canonicalization passes (in this case: InstCombine). It it possible to not run that pass, disable the matching of this particular pattern, or use some other trick to not be matched. In any case, we cannot rely on the InstCombine to happen. It might be safer to bail out if any RegionStmt is encountered.

gareevroman marked an inline comment as done.Aug 14 2022, 3:39 AM

gareevroman added inline comments.

polly/lib/Transform/MatmulOptimizer.cpp
1284	I see. I haven't managed to fix that test case. So, I've decided to remove it. I've left the check that bails out if any RegionStmt is encountered in containsOnlyTCAcc.

gareevroman mentioned this in rGa5d981045de7: [Polly] Remove the test case that depends on InstCombine and DeLICM..Aug 14 2022, 6:07 AM

Revision Contents

Path

Size

polly/

lib/

Transform/

MatmulOptimizer.cpp

618 lines

ScheduleOptimizer.cpp

36 lines

test/

ScheduleOptimizer/

pattern-matching-based-opts-after-delicm.ll

32 lines

pattern-matching-based-opts-after-delicm_2.ll

108 lines

pattern-matching-based-opts.ll

3 lines

pattern-matching-based-opts_11.ll

4 lines

pattern-matching-based-opts_15.ll

4 lines

pattern-matching-based-opts_16.ll

64 lines

pattern-matching-based-opts_17.ll

64 lines

pattern-matching-based-opts_18.ll

84 lines

pattern-matching-based-opts_19.ll

84 lines

pattern-matching-based-opts_2.ll

4 lines

pattern-matching-based-opts_20.ll

94 lines

pattern-matching-based-opts_4.ll

6 lines

Diff 393734

polly/lib/Transform/MatmulOptimizer.cpp

//===- MatmulOptimizer.cpp -----------------------------------------------===// //===- MatmulOptimizer.cpp -----------------------------------------------===//

// //

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information. // See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

#include "polly/MatmulOptimizer.h" #include "polly/MatmulOptimizer.h"

#include "polly/DependenceInfo.h" #include "polly/DependenceInfo.h"

#include "polly/Options.h" #include "polly/Options.h"

#include "polly/ScheduleTreeTransform.h" #include "polly/ScheduleTreeTransform.h"

#include "polly/ScopInfo.h" #include "polly/ScopInfo.h"

#include "polly/ScopPass.h" #include "polly/ScopPass.h"

#include "polly/Simplify.h" #include "polly/Simplify.h"

#include "polly/Support/GICHelper.h"

#include "polly/Support/ISLTools.h" #include "polly/Support/ISLTools.h"

#include "llvm/ADT/ArrayRef.h" #include "llvm/ADT/ArrayRef.h"

#include "llvm/ADT/DenseSet.h"

#include "llvm/ADT/Optional.h" #include "llvm/ADT/Optional.h"

#include "llvm/ADT/Sequence.h" #include "llvm/ADT/Sequence.h"

#include "llvm/ADT/SetOperations.h"

#include "llvm/ADT/SmallVector.h" #include "llvm/ADT/SmallVector.h"

#include "llvm/ADT/StringRef.h" #include "llvm/ADT/StringRef.h"

#include "llvm/ADT/iterator_range.h" #include "llvm/ADT/iterator_range.h"

#include "llvm/Analysis/TargetTransformInfo.h" #include "llvm/Analysis/TargetTransformInfo.h"

#include "llvm/IR/DataLayout.h" #include "llvm/IR/DataLayout.h"

#include "llvm/IR/Function.h" #include "llvm/IR/Function.h"

#include "llvm/IR/Module.h" #include "llvm/IR/Module.h"

#include "llvm/Support/CommandLine.h" #include "llvm/Support/CommandLine.h"

▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines static cl::opt<int> VectorRegisterBitwidth(

cl::Hidden, cl::init(-1), cl::ZeroOrMore, cl::cat(PollyCategory)); cl::Hidden, cl::init(-1), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<int> PollyPatternMatchingNcQuotient( static cl::opt<int> PollyPatternMatchingNcQuotient(

"polly-pattern-matching-nc-quotient", "polly-pattern-matching-nc-quotient",

cl::desc("Quotient that is obtained by dividing Nc, the parameter of the" cl::desc("Quotient that is obtained by dividing Nc, the parameter of the"

"macro-kernel, by Nr, the parameter of the micro-kernel"), "macro-kernel, by Nr, the parameter of the micro-kernel"),

cl::Hidden, cl::init(256), cl::ZeroOrMore, cl::cat(PollyCategory)); cl::Hidden, cl::init(256), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<bool>

PMBasedTCOpts("polly-tc-opt",

cl::desc("Perform optimizations of tensor contractions based "

"on pattern matching"),

cl::init(false), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<bool>

PMBasedMMMOpts("polly-matmul-opt",

cl::desc("Perform optimizations of matrix multiplications "

"based on pattern matching"),

cl::init(true), cl::ZeroOrMore, cl::cat(PollyCategory));

namespace { namespace {

/// Parameters of the micro kernel. /// Parameters of the micro kernel.

/// ///

/// Parameters, which determine sizes of rank-1 (i.e., outer product) update /// Parameters, which determine sizes of rank-1 (i.e., outer product) update

/// used in the optimized matrix multiplication. /// used in the optimized matrix multiplication.

struct MicroKernelParamsTy { struct MicroKernelParamsTy {

int Mr; int Mr;

int Nr; int Nr;

Show All 18 Lines struct MatMulInfoTy {

MemoryAccess *B = nullptr; MemoryAccess *B = nullptr;

MemoryAccess *ReadFromC = nullptr; MemoryAccess *ReadFromC = nullptr;

MemoryAccess *WriteToC = nullptr; MemoryAccess *WriteToC = nullptr;

int i = -1; int i = -1;

int j = -1; int j = -1;

int k = -1; int k = -1;

}; };

/// Parameters of the tensor contraction operands.

///

/// Parameters, which describe access relations that represent operands of the

/// tensor contraction.

MeinersburUnsubmitted

Done

Please add more details on what the members represent.

Meinersbur: Please add more details on what the members represent.

struct TCInfoTy {

// Memory accesses that represent reading from tensors, which are operands of

// the tensor contraction.

MemoryAccess *A = nullptr;

MemoryAccess *B = nullptr;

MeinersburUnsubmitted

Done

struct TCInfoTy {

- // Memory accesses that represent reading from tensors, which are operands of

- // the tensor contraction.

+ /// @{

+ /// Memory accesses that represent reading from tensors, which are operands of

+ /// the tensor contraction.

MemoryAccess *A = nullptr;

MemoryAccess *B = nullptr;

+ /// @}

// Memory accesses that represent reading from and writing into the tensor,

Please use doxygen comments to describe class/struct members. @{ @} can be used to group them.

Also add an empty line before a new comment begins.

Meinersbur: Please use doxygen comments to describe class/struct members. `@{ @}` can be used to [[ https…

// Memory accesses that represent reading from and writing into the tensor,

// which contains the result of the tensor contraction.

MemoryAccess *ReadFromC = nullptr;

MeinersburUnsubmitted

Done

std::set is a high-overhead implementation. Consider using DenseSet or SmallDenseSet. See https://www.llvm.org/docs/ProgrammersManual.html#llvm-adt-denseset-h

Meinersbur: `std::set` is a high-overhead implementation. Consider using `DenseSet` or `SmallDenseSet`. See…

MemoryAccess *WriteToC = nullptr;

// Input dimensions of the schedule space, which represent free

// indices of tensors.

SmallDenseSet<int> I;

MeinersburUnsubmitted

Done

AFAIU multiplication by β is not part of this detection, but required to be loop-distributed by the isl scheduler.

Meinersbur: AFAIU multiplication by β is not part of this detection, but required to be loop-distributed by…

gareevromanAuthorUnsubmitted

Done

Yes, it's not. I've added a comment about this.

gareevroman: Yes, it's not. I've added a comment about this.

SmallDenseSet<int> J;

// Input dimensions of the schedule space, which represent contracted

// indices of tensors.

MeinersburUnsubmitted

Done

Is there an argument to use 30 and small size? If not, consider using just SmallVector<int>.

Meinersbur: Is there an argument to use 30 and small size? If not, consider using just `SmallVector<int>`.

SmallDenseSet<int> P;

// Sizes of tensor dimensions for corresponding input dimensions of

// the schedule space. The size of the tensor dimension can be larger than

// the size of the corresponding input dimension of the schedule space.

// This does not correspond to a tensor contraction. However, such a pattern

// will be optimized by the transformation.

SmallVector<int> DimensionSizes;

SmallVector<int> ADimensions;

MeinersburUnsubmitted

Done

@{ is not needed when documenting just a single member.

Meinersbur: `@{` is not needed when documenting just a single member.

SmallVector<int> BDimensions;

SmallVector<int> CDimensions;

// Permutations of indices of I, J, and P, which describe operands of

// the tensor contraction and its result.

SmallVector<int> OrderedI;

SmallVector<int> OrderedJ;

SmallVector<int> OrderedP;

};

/// Create an isl::union_set, which describes the option of the form /// Create an isl::union_set, which describes the option of the form

/// [isolate[] -> unroll[x]]. /// [isolate[] -> unroll[x]].

/// ///

/// @param Ctx An isl::ctx, which is used to create the isl::union_set. /// @param Ctx An isl::ctx, which is used to create the isl::union_set.

static isl::union_set getUnrollIsolatedSetOptions(isl::ctx Ctx) { static isl::union_set getUnrollIsolatedSetOptions(isl::ctx Ctx) {

isl::space Space = isl::space(Ctx, 0, 0, 1); isl::space Space = isl::space(Ctx, 0, 0, 1);

isl::map UnrollIsolatedSetOption = isl::map::universe(Space); isl::map UnrollIsolatedSetOption = isl::map::universe(Space);

isl::id DimInId = isl::id::alloc(Ctx, "isolate", nullptr); isl::id DimInId = isl::id::alloc(Ctx, "isolate", nullptr);

▲ Show 20 Lines • Show All 848 Lines • ▼ Show 20 Lines if (LeafType != isl_schedule_node_leaf ||

isl_union_map_n_map(PartialSchedule.get()) != 1) isl_union_map_n_map(PartialSchedule.get()) != 1)

return false; return false;

auto NewPartialSchedule = isl::map::from_union_map(PartialSchedule); auto NewPartialSchedule = isl::map::from_union_map(PartialSchedule);

if (containsMatrMult(NewPartialSchedule, D, MMI)) if (containsMatrMult(NewPartialSchedule, D, MMI))

return true; return true;

return false; return false;

} }

/// Get the dimension size.

///

/// Return the size of the dimension @p Pos, which is obtained from @p SAI.

/// Return -1 in the case of the first dimension of a multi-dimensional array,

/// since the ScopArrayInfo class does not carry size information.

///

/// @param SAI The information about the array.

/// @param Pos The position of the dimension.

/// @return The size of the dimension.

static int getDimSize(const ScopArrayInfo *SAI, unsigned Pos) {

if (Pos == 0)

return -1;

const llvm::SCEV *SCEVDimSize = SAI->getDimensionSize(Pos);

assert(SCEVDimSize);

MeinersburUnsubmitted

Done

const llvm::SCEV *SCEVDimSize = SAI->getDimensionSize(Pos);

- assert(SCEVDimSize && L"Prevent the undefined behavior");

+ assert(SCEVDimSize && "Prevent the undefined behavior");

auto *ConstantDimSize = dyn_cast<const SCEVConstant>(SCEVDimSize);

[style] No reason to make this a wide string literal, especially if just used as an assertion failed message.

Apples to other occurrences as well.

Meinersbur: [style] No reason to make this a wide string literal, especially if just used as an assertion…

auto *ConstantDimSize = dyn_cast<const SCEVConstant>(SCEVDimSize);

assert(ConstantDimSize);

auto *IntDimSize = dyn_cast<ConstantInt>(ConstantDimSize->getValue());

assert(IntDimSize);

return IntDimSize->getSExtValue();

}

/// Check whether the access relation has the specified form.

///

/// Check that the access relation @p AccMap has the form T[I0, …, In], where

/// indexes I0, …, In are specified by @p Dimensions.

///

/// @param Domain The domain of the access relation.

/// @param AccMap The access relation to be checked.

/// @param Dimensions The permutation of the subset of the input dimensions.

/// @return True if @p AccMap has the expected form and false,

/// otherwise.

static bool isCorrectAccessMap(isl::set Domain, isl::map AccMap,

ArrayRef<int> Dimensions) {

MeinersburUnsubmitted

Done

static bool isCorrectAccessMap(isl::set Domain, isl::map AccMap,

- const SmallVector<int, 30> &Dimensions) {

+ ArrayRef<int> Dimensions) {

isl::space Space = AccMap.get_space();

Meinersbur:

isl::space Space = AccMap.get_space();

if (unsignedFromIslSize(Space.dim(isl::dim::out)) != Dimensions.size())

return false;

isl::map Universe = isl::map::universe(Space);

// Create an access relation of the following form:

MeinersburUnsubmitted

Done

isl::map PossibleTensor = isl::manage(Universe.copy());

- for (int i = 0; i < static_cast<int>(Dimensions.size()); i++) {

+ for (unsigned i = 0; i < unsignedFromIslSize(Dimensions.size()); i++) {

const int InPos = Dimensions[i];

or introduce intFromIslSize.

Meinersbur: or introduce `intFromIslSize`.

gareevromanAuthorUnsubmitted

Done

As far I understand, Dimensions.size() returns a value of type size_t instead of a value of the type isl_size. So, in the new version I used the unsigned type to avoid the cast.

gareevroman: As far I understand, Dimensions.size() returns a value of type size_t instead of a value of the…

MeinersburUnsubmitted

Not Done

rangeIslSize should make it easier.

Meinersbur: `rangeIslSize` should make it easier.

gareevromanAuthorUnsubmitted

Done

I think rangeIslSize can’t be used in this case. However, I’ve tried to use rangeIslSize to improve the patch.

gareevroman: I think rangeIslSize can’t be used in this case. However, I’ve tried to use rangeIslSize to…

// [I0, …, Im] -> [Il, …, In], where indexes

// Il, …, In are specified by @p Dimensions.

isl::map PossibleTensor = isl::manage(Universe.copy());

for (unsigned i = 0; i < Dimensions.size(); i++) {

const int InPos = Dimensions[i];

if (InPos == -1)

return false;

PossibleTensor =

PossibleTensor.equate(isl::dim::in, InPos, isl::dim::out, i);

}

AccMap = AccMap.intersect_domain(Domain);

PossibleTensor = PossibleTensor.intersect_domain(Domain);

// If AccMap spans entire domain (Non-partial write),

// compute FirstPos and SecondPos.

// If AccMap != PossibleTensor here (the two maps have been gisted at

// this point), it means that the writes are not complete, or in other

// words, it is a Partial write and Partial writes must be rejected.

return AccMap.is_equal(PossibleTensor);

MeinersburUnsubmitted

Not Done

I like the idea of verifying the correctness by reconstructing and comparing to the original.

Maybe do it at the end to verify that the entire TCInfoTy is correct? On the other size, earlier fail would be better. What do you think?

Meinersbur: I like the idea of verifying the correctness by reconstructing and comparing to the original.

gareevromanAuthorUnsubmitted

Done

Other parts of TCInfoTy are verified in isTCOperandAcc too. I think that it would be better to verify the related information in one place as much and as early as possible.

Probably, the earlier fail would simplify the debugging, since we exactly know the form of memory accesses and can rely on it. Additionally, the performance can be improved, since the earlier fail helps to avoid additional operations with sets.

gareevroman: Other parts of TCInfoTy are verified in isTCOperandAcc too. I think that it would be better to…

}

/// Check whether the access represents the tensor contraction operand.

///

/// Check that the access relation @p AccMap has the form T[i1, …, in].

MeinersburUnsubmitted

Done

/// Obtained indexes i1, …, in, their sizes and their permutation are stored

- /// into @p IndexSet, @p DimensionSizes, and @p Dimensions, respectively.

+ /// into @p IndexSet, @p DimensionSizes, and @p Dimensions, respectively.

///

/// @param Domain The domain of the access relation.

Meinersbur:

/// Obtained indexes i1, …, in, their sizes and their permutation are stored

/// into @p IndexSet, @p DimensionSizes, and @p Dimensions, respectively.

///

/// @param Domain The domain of the access relation.

/// @param AccMap The access relation to be checked.

/// @param IndexSet The subset of the input dimensions.

/// @param DimensionSizes Sizes of the input dimensions of @p Dimensions.

/// @param Dimensions The permutation of the subset of the input dimensions.

/// @return True if @p AccMap has the expected form and false,

MeinersburUnsubmitted

Done

std::set<int> &IndexSet,

- SmallVector<int, 30> &DimensionSizes,

- SmallVector<int, 30> &Dimensions) {

+ SmallVectorImpl<int> &DimensionSizes,

+ SmallVectorImpl<int> &Dimensions) {

isl::id Id = AccMap.get_tuple_id(isl::dim::out);

SmallVectorImpl is not specific to what the vector's small size is.

Meinersbur: `SmallVectorImpl` is not specific to what the vector's small size is.

/// otherwise.

static bool isTCOperandAcc(isl::set Domain, isl::map AccMap,

SmallDenseSet<int> &IndexSet,

SmallVectorImpl<int> &DimensionSizes,

SmallVectorImpl<int> &Dimensions) {

isl::id Id = AccMap.get_tuple_id(isl::dim::out);

const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(Id);

assert(SAI && "AccMap should represent memory access");

// Fix values of output dimensions with respect to their positions.

MeinersburUnsubmitted

Done

unsigned InDimNum = unsignedFromIslSize(CheckMap.dim(isl::dim::in));

- for (unsigned i = 0; i < InDimNum; i++) {

+ for (unsigned i : rangeIslSize(0, CheckMap.dim(isl::dim::in)))

isl::val Val = isl::manage(

Consider using polly::rangeIslSize for iterating over dimensions.

Meinersbur: Consider using `polly::rangeIslSize` for iterating over dimensions.

isl::map CheckMap = isl::manage(AccMap.copy());

unsigned OutDimNum = unsignedFromIslSize(CheckMap.dim(isl::dim::out));

for (unsigned i = 0; i < OutDimNum; i++)

CheckMap = CheckMap.fix_si(isl::dim::out, i, i);

MeinersburUnsubmitted

Done

This is a weird way to find out which indices map to what other index, I guess the equivalent of isMatMulOperandAcc. It requires that the dimension number if part of the AccMap's range, and if there is any expression will fail (eg Stmt[i][j] -> A[i-1][j+1] matches the wrong dimensions), but at least there is the verification afterwards.

I am not sure I like this sort of cleverness; I'd rather expect some sort of introspection into the map's coefficients, but I also think this should work in nearly all relevant cases and should be save due to the verification, so lets keep it.

However, please document this better, eg. add an example on what is expected to happen.

Meinersbur: This is a weird way to find out which indices map to what other index, I guess the equivalent…

// In the case of the tensor contraction, values of output dimensions are

// fixed and form a permutation of a subset of values of input dimensions.

// Try to obtain the permutation and sizes of corresponding input dimensions.

Dimensions.assign(OutDimNum, -1);

for (unsigned i : rangeIslSize(0, CheckMap.dim(isl::dim::in))) {

isl::val Val = isl::manage(

isl_map_plain_get_val_if_fixed(CheckMap.get(), isl_dim_in, i));

MeinersburUnsubmitted

Done

The "plain" function are unfortunately not very robust, eg its result is different depending on the internal representation. I'd suggestion getConstant (from ISLTools) but only takes an pw_aff.

Could you extract uses of plain_get_val_if_fixed into such a function, and mark it as TODO to cope with it later?

Meinersbur: The "plain" function are unfortunately not very robust, eg its result is different depending on…

if (!Val.is_int())

continue;

int OutPos = -1;

llvm::APInt ValAPInt = APIntFromVal(Val);

if (ValAPInt.isSignedIntN(32))

OutPos = ValAPInt.getSExtValue();

assert((OutPos != -1) && (OutPos < static_cast<int>(OutDimNum)));

if (IndexSet.count(i))

return false;

IndexSet.insert(i);

Dimensions[OutPos] = i;

if (DimensionSizes[i] <= 0)

DimensionSizes[i] = getDimSize(SAI, OutPos);

}

MeinersburUnsubmitted

Done

/// @return The set intersection.

- std::set<int> intersect(const std::set<int> &A, const std::set<int> &B) {

+ static std::set<int> intersect(const std::set<int> &A, const std::set<int> &B) {

std::set<int> Intersection;

Although already in an anon namespace, the other methods add static as well. I found it helps the compiler to warn if a static function is unnused.

Meinersbur: Although already in an anon namespace, the other methods add `static` as well. I found it helps…

return isCorrectAccessMap(Domain, AccMap, Dimensions);

}

MeinersburUnsubmitted

Done

Do you know of #include <llvm/ADT/SetOperations.h>? Unfortunately, these modify one set rather than returning a new set.

Meinersbur: Do you know of `#include <llvm/ADT/SetOperations.h>`? Unfortunately, these modify one set…

/// Find the intersection of two sets.

///

/// Find the intersection of the set @p A and the set @p B.

///

/// @param A, B Sets to intersect.

/// @return The set intersection.

static SmallDenseSet<int> intersect(const SmallDenseSet<int> &A,

const SmallDenseSet<int> &B) {

SmallDenseSet<int> Intersection = A;

set_intersect(Intersection, B);

return Intersection;

}

/// Determine the access that writes to the tensor, which contains

/// the result of the tensor contraction.

///

/// @param Domain The domain of the statement.

/// @param Stmt The statement, which writes to memory.

/// @param TCI The information about the tensor contraction.

/// @param IandJIndexSet The set, which contains free indexes of tensors.

/// @return The determined MemoryAccess, or nullptr if there is no necessary

/// access within the SCoP.

static MemoryAccess *getWriteAccess(isl::set Domain, ScopStmt *Stmt,

TCInfoTy &TCI,

SmallDenseSet<int> &IandJIndexSet) {

TCI.WriteToC = nullptr;

SmallVector<MemoryAccess *, 32> Accesses = getAccessesInOrder(*Stmt);

for (MemoryAccess *MemA : reverse(Accesses)) {

if (!MemA->isLatestArrayKind())

MeinersburUnsubmitted

Done

auto *MemA = Accesses.end() - 1;

- for (; MemA != Accesses.begin(); MemA--) {

+ for (MemoryAccess *MemA : reverse(Accesses)) {

MemoryAccess *MemAccessPtr = *MemA;

Meinersbur:

MeinersburUnsubmitted

Not Done

I can use JScopImport to set a scalar memory access to a partial write without adding additional dependencies; that is, I don't think this can be just ignored.

I suggest to have a single function that calls getAccessesInOrder and sort out which MemAccess is read/write in there, then analyze them.

Meinersbur: I can use JScopImport to set a scalar memory access to a partial write without adding…

gareevromanAuthorUnsubmitted

Done

I am not sure whether modifications of implementations of tensor contractions, which contain read and write scalar memory accesses, are useful in practice.

Moreover, since bundles of induction variables I, J, P can contain an unlimited number of dimensions, we possibly cannot follow the algorithm from the containsOnlyMatrMultAcc function, which permutes dimensions and checks that additional memory accesses have stride 0 in terms of dimensions MMM.i, MMM.j, and MMM.k. Consequently, such memory accesses can be treated as scalar memory accesses. I have not come up with an effective alternative yet.

That is why I do not consider scalar memory accesses in getWriteAccess and setReadAccesses functions. Could we mark it as TODO and do it future?

gareevroman: I am not sure whether modifications of implementations of tensor contractions, which contain…

MeinersburUnsubmitted

Not Done

The concern is that I can modify what isLatestArrayKind() returns by simply importing a JScop. The continue just ignores such weirdness but I think it is safer to fail in this case.

You yourself mention that scalar accesses are likely not useful, so why not fail when one is found instead (return null instead of continue)? Some exceptions may be possible, such as read-only scalars (VirtualUse::ReadOnly, VirtualUse::Synthesizable)

Meinersbur: The concern is that I can modify what `isLatestArrayKind()` returns by simply importing a JScop.

gareevromanAuthorUnsubmitted

Not Done

I've added such a return statement to avoid scalar write memory accesses. Sorry, I was wrong. We need scalar read memory accesses. For example, in the case of the following matrix-matrix multiplication, a SCoP statement, which represents the body of the loop, contains the constant alpha.

C = alpha*A*B

Could we accept non-partial scalar read memory accesses? I think this is legal.

gareevroman: I've added such a return statement to avoid scalar write memory accesses. Sorry, I was wrong.

continue;

// The last memory access should be a write memory access.

if (!MemA->isWrite())

return nullptr;

isl::map AccMap = MemA->getLatestAccessRelation();

if (!isTCOperandAcc(Domain, AccMap, IandJIndexSet, TCI.DimensionSizes,

TCI.CDimensions))

return nullptr;

return MemA;

MeinersburUnsubmitted

Done

This computes whether two sets a disjoint, it should not be required to compute the intersection.

Meinersbur: This computes whether two sets a disjoint, it should not be required to compute the…

gareevromanAuthorUnsubmitted

Done

That check is redundant. Thanks.

gareevroman: That check is redundant. Thanks.

}

return nullptr;

}

/// Determine an access, which reads elements of an operand of the tensor

/// contraction

///

/// @param MemAccessPtr The access, which reads elements of the tensor.

/// @param IndexSet The set, which contains indexes of the tensors.

/// @param IandJIndexSet The set, which contains free indexes of tensors.

/// @param Dimensions The permutation of the subset of the input dimensions.

/// @param TCI The information about the tensor contraction.

/// @return True if the memory access @p MemAccessPtr corresponds

/// to the tensor contraction.

static bool setReadAccess(MemoryAccess *MemAccessPtr,

const SmallDenseSet<int> &IndexSet,

MeinersburUnsubmitted

Done

const std::set<int> &IandJIndexSet,

- SmallVector<int, 30> Dimensions, TCInfoTy &TCI) {

+ ArrayRef<int> Dimensions, TCInfoTy &TCI) {

if (!TCI.A) {

Meinersbur:

gareevromanAuthorUnsubmitted

Done

As far as I understand, we cannot do this here because of the assignment to TCI.ADimensions and TCI.BDimensions

gareevroman: As far as I understand, we cannot do this here because of the assignment to TCI.ADimensions and…

MeinersburUnsubmitted

Not Done

The assignments should just make a copy of the array . With Dimensions being passed by-value, the caller has to make the copy which it should not need to.

SmallVector has no overload for being assigned an ArrayRef, but you could use llvm::append_range to insert all the values.

Meinersbur: The assignments should just make a copy of the array . With `Dimensions` being passed by-value…

gareevromanAuthorUnsubmitted

Done

I think we can use llvm::replace to avoid clearing the vector and preserve the logic.

gareevroman: I think we can use llvm::replace to avoid clearing the vector and preserve the logic.

const SmallDenseSet<int> &IandJIndexSet,

SmallVector<int> Dimensions, TCInfoTy &TCI) {

if (!TCI.A) {

// Probably IndexSet is a union of I and P sets.

if (intersect(IndexSet, TCI.P).size() != TCI.P.size())

return false;

MeinersburUnsubmitted

Done

Introduce a is_superset (etc) call?

Meinersbur: Introduce a `is_superset` (etc) call?

// Obtain the set I.

TCI.I = std::move(set_difference(IndexSet, TCI.P));

if (intersect(IandJIndexSet, TCI.I).size() != TCI.I.size())

return false;

// Obtain the set J.

TCI.J = std::move(set_difference(IandJIndexSet, TCI.I));

// Set the first operand of the tensor contraction.

TCI.A = MemAccessPtr;

TCI.ADimensions = std::move(Dimensions);

return true;

}

if (!TCI.B) {

// IndexSet should be a union of J and P sets.

if (!(intersect(IndexSet, TCI.J).size() == TCI.J.size()) &&

(intersect(IndexSet, TCI.P).size() == TCI.P.size()) &&

(IndexSet.size() == TCI.P.size() + TCI.J.size()))

MeinersburUnsubmitted

Done

Could we add utility functions such that this becomes unite(J, P) == IndexSet?

Meinersbur: Could we add utility functions such that this becomes `unite(J, P) == IndexSet`?

return false;

// Set the second operand of the tensor contraction.

TCI.B = MemAccessPtr;

TCI.BDimensions = std::move(Dimensions);

return true;

}

return false;

}

/// Check that all memory accesses of the statement, except from the last

MeinersburUnsubmitted

Done

getAccessesInOrder requires Stmt to not be a RegionStmt. Please add a test for it.

Meinersbur: `getAccessesInOrder` requires `Stmt` to not be a RegionStmt. Please add a test for it.

gareevromanAuthorUnsubmitted

Done

I’ve added a check to containsOnlyTCAcc. Could you clarify how the test case should look like? Should it be a region statement that contains a matrix multiplication with right order of memory accesses?

gareevroman: I’ve added a check to containsOnlyTCAcc. Could you clarify how the test case should look like?

MeinersburUnsubmitted

Not Done

Test in containsOnlyTCAcc is exactly what I was looking for. A region statement could look like this:

c = C[i][j];
if (/*non-affine condition*/) {
  (void)A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}

which has the correct order of accesses but is obviously not what we are looking for.

Meinersbur: Test in `containsOnlyTCAcc` is exactly what I was looking for. A region statement could look…

gareevromanAuthorUnsubmitted

Done

Thanks for the example! I have added a corresponding test case. If I am not mistaken, it requires DeLICM.

gareevroman: Thanks for the example! I have added a corresponding test case. If I am not mistaken, it…

MeinersburUnsubmitted

Not Done

It does not require DeLICM, but -polly-allow-nonaffine-branches (which is enabled by default)

Meinersbur: It does not require DeLICM, but `-polly-allow-nonaffine-branches` (which is enabled by default)

gareevromanAuthorUnsubmitted

Not Done

If I'm not mistaken, in your example the form of the dependencies doesn't correspond to the pattern.

c = C[i][j];
if (/*non-affine condition*/) {
  A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}

MayWrite: { Stmt_for_body8TOfor_inc[i0, i1, i2] -> MemRef_C[i0, i1] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }

I've added a slightly modified version of it to polly/test/ScheduleOptimizer/pattern-matching-based-opts_23.ll. It produces a region statement too.

for (int i = 0; i < 32; i++)
  for (int j = 0; j < 32; j++)
    for (int k = 0; k < 32; k++) {
      int c = C[i][j];
      if (i*j*k < 10) {
        C[i][j] = A[i][k] + B[k][j];
      } else {
        C[i][j] = c;
      } 
}

However, it introduces store merge phi nodes. It makes DeLICM necessary.

Statements {
	Stmt_for_body8__TO__if_end
        Domain :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> [i0, i1, i2, 0] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_A[i0, i2] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_B[i2, i1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
	Stmt_if_end
        Domain :=
            { Stmt_if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_if_end[i0, i1, i2] -> [i0, i1, i2, 1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };

gareevroman: If I'm not mistaken, in your example the form of the dependencies doesn't correspond to the…

MeinersburUnsubmitted

Not Done

The storemerge PHI node is introduced by one of the canonicalization passes (in this case: InstCombine). It it possible to not run that pass, disable the matching of this particular pattern, or use some other trick to not be matched. In any case, we cannot rely on the InstCombine to happen. It might be safer to bail out if any RegionStmt is encountered.

Meinersbur: The `storemerge` PHI node is introduced by one of the canonicalization passes (in this case…

gareevromanAuthorUnsubmitted

Done

I see. I haven't managed to fix that test case. So, I've decided to remove it. I've left the check that bails out if any RegionStmt is encountered in containsOnlyTCAcc.

gareevroman: I see. I haven't managed to fix that test case. So, I've decided to remove it. I've left the…

/// one, are read memory accesses, which read elements of operands of the tensor

/// contraction and its result.

///

/// @param Domain The domain of the statement.

/// @param Stmt The statement, which writes to memory.

/// @param TCI The information about the tensor contraction.

/// @param IandJIndexSet The set, which contains free indexes of tensors.

MeinersburUnsubmitted

Done

Compiler warning:

/home/meinersbur/src/llvm-project/polly/lib/Transform/MatmulOptimizer.cpp:1310:13: warning: moving a temporary object prevents copy elision [-Wpessimizing-move]
    TCI.I = std::move(set_difference(IndexSet, TCI.P));

The result of set_difference is already an r-value, no need to cast it to an r-value.

Meinersbur: Compiler warning: ``` /home/meinersbur/src/llvm-project/polly/lib/Transform/MatmulOptimizer.cpp…

/// @return True if all read memory accesses of the statement @p Stmt correspond

/// to the tensor contraction.

static bool setReadAccesses(isl::set Domain, ScopStmt *Stmt, TCInfoTy &TCI,

MeinersburUnsubmitted

Done

If any of the returns are executed, what causes the pattern to be rejected (it's not return false)?

Meinersbur: If any of the returns are executed, what causes the pattern to be rejected (it's not `return…

SmallDenseSet<int> &IandJIndexSet) {

TCI.A = nullptr;

MeinersburUnsubmitted

Done

Same compiler warning.

Meinersbur: Same compiler warning.

TCI.B = nullptr;

TCI.ReadFromC = nullptr;

SmallVector<MemoryAccess *, 32> Accesses = getAccessesInOrder(*Stmt);

for (auto *MemA = Accesses.begin(); *MemA != TCI.WriteToC; MemA++) {

MemoryAccess *MemAccessPtr = *MemA;

if (!MemAccessPtr->isLatestArrayKind())

continue;

// All memory accesses, except from the last one, should be read memory

// accesses.

if (MemAccessPtr->isWrite())

return false;

// There is only one memory access, which reads elements of the result of

// the tensor contraction.

isl::map AccMap = MemAccessPtr->getLatestAccessRelation();

if (AccMap.is_equal(TCI.WriteToC->getLatestAccessRelation())) {

if (TCI.ReadFromC)

return false;

TCI.ReadFromC = MemAccessPtr;

continue;

}

SmallVector<int> Dimensions;

SmallDenseSet<int> IndexSet;

if (!isTCOperandAcc(Domain, AccMap, IndexSet, TCI.DimensionSizes,

Dimensions))

return false;

if (!setReadAccess(MemAccessPtr, IndexSet, IandJIndexSet, Dimensions, TCI))

return false;

}

// Check that there are read memory accesses, which read elements of operands

// of the tensor contraction and its result.

return TCI.ReadFromC && TCI.A && TCI.B;

}

/// Check accesses to operands of the tensor contraction.

///

/// Check that accesses of the SCoP statement, which corresponds to

/// the partial schedule @p PartialSchedule, represent accesses

/// to the non-scalar operands of the tensor contraction.

///

/// @param Domain The domain of the SCoP statement.

/// @param PartialSchedule The partial schedule of the SCoP statement.

/// @param TCI Parameters of the tensor contraction operands.

/// @return True if the corresponding SCoP statement

/// represents tensor contraction and false,

/// otherwise.

static bool containsOnlyTCAcc(isl::set Domain, isl::map PartialSchedule,

TCInfoTy &TCI) {

isl::id InputDimsId = PartialSchedule.get_tuple_id(isl::dim::in);

ScopStmt *Stmt = static_cast<ScopStmt *>(InputDimsId.get_user());

// In region statements, the order of memory accesses execution is not

// predictable at compile-time.

if ((Stmt->size() <= 1) || Stmt->isRegionStmt())

return false;

unsigned DimNum = unsignedFromIslSize(PartialSchedule.dim(isl::dim::in));

TCI.DimensionSizes.resize(DimNum);

SmallDenseSet<int> IandJIndexSet;

TCI.WriteToC = getWriteAccess(Domain, Stmt, TCI, IandJIndexSet);

if (!TCI.WriteToC)

return false;

MeinersburUnsubmitted

Done

/// and false, otherwise.

static bool isTcDep(isl::set DepDelta, unsigned Pos, isl::set BoundDeltas,

- std::set<int> *IndexSet) {

+ const std::set<int> &IndexSet) {

if ((unsignedFromIslSize(DepDelta.n_basic_set()) != 1) ||

Meinersbur:

if (intersect(IandJIndexSet, TCI.P).size() != 0)

return false;

MeinersburUnsubmitted

Done

The check should not depend on n_basic_set, which is fragile and depends on whether on eg. coalesce was successful. Consider using something like polly::getConstant.

Meinersbur: The check should not depend on `n_basic_set`, which is fragile and depends on whether on eg.

gareevromanAuthorUnsubmitted

Done

I think that this check was redundant. I’ve removed it.

gareevroman: I think that this check was redundant. I’ve removed it.

if (!setReadAccesses(Domain, Stmt, TCI, IandJIndexSet))

return false;

return true;

}

/// Check that dependency corresponds to the tensor contraction.

MeinersburUnsubmitted

Done

return true;

}

- /// Check that dependency corresponds to the tensor contraction.

+ /// Check that dependency corresponds to the tensor contraction carried over loop dimension @p Pos.

///

/// Check that the dependency has the form

Meinersbur:

///

/// Check that the dependency has the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) ->

/// S(..., ki + 1, min(k(i + 1)), ..., min(kn), ...), where S is the SCoP

/// statement. For this purpose, we analyze the set @p DepDelta, which

/// represents the differences between image elements and domain elements of

/// the corresponding map.

///

/// @param DepDelta The set contains the differences between image elements

/// and corresponding domain elements of the map, which

/// represents the dependency.

/// @param Pos The position of the index ki.

/// @param BoundDeltas In the case of indexes of ki, the difference between

/// image elements and corresponding domain elements

/// corresponds to the difference between lexicographic

/// minimum and lexicographic maximum of the corresponding

/// dimension of the domain of the statement.

/// @param IndexSet Obtained indexes ki, which describe the dependency.

/// @return True if dependencies correspond to the tensor contraction

/// and false, otherwise.

static bool isTcDep(isl::set DepDelta, unsigned Pos, isl::set BoundDeltas,

MeinersburUnsubmitted

Done

This seems to check setermine whether there is a reduction (contraction) carried by loop number Pos. The function name could be more meaningful. Suggestion: isDepCarryingReductionOverDim (not nice, but "TcDep" can mean anything)

Meinersbur: This seems to check setermine whether there is a reduction (contraction) carried by loop number…

gareevromanAuthorUnsubmitted

Done

Could we name it isReductionCarriedOverDim? I think, in this case, we should rename the parameter Pos to Dim to make it more readable.

gareevroman: Could we name it isReductionCarriedOverDim? I think, in this case, we should rename the…

MeinersburUnsubmitted

Done

Sounds ok.

Meinersbur: Sounds ok.

SmallDenseSet<int> *IndexSet) {

MeinersburUnsubmitted

Done

static bool isTcDep(isl::set DepDelta, unsigned Pos, isl::set BoundDeltas,

- SmallDenseSet<int> *IndexSet) {

+ const SmallDenseSet<int> &IndexSet) {

// Check the difference between the image element and the domain element

Consider passing by const-reference.

Meinersbur: Consider passing by const-reference.

// Check the difference between the image element and the domain element

// in the case of the index ki.

if (!DepDelta.plain_get_val_if_fixed(isl::dim::set, Pos).is_one())

MeinersburUnsubmitted

Done

plain_get_val_if_fixed is not really robust as it depends on the internal representation that can be different after eg simplify.

Since this just checks to a specific value, the best would be to create a new set where all the fixed dimensions are that value (here: 1), and check whether DepDelta is a subset of it.

Meinersbur: `plain_get_val_if_fixed` is not really robust as it depends on the internal representation that…

return false;

// Image elements and corresponding domain elements should be equal in

// the case of positions, which are lower than the specified position.

for (unsigned j = 0; j < Pos; j++)

if (!DepDelta.plain_get_val_if_fixed(isl::dim::set, j).is_zero())

return false;

// Compute a set, which is used to analyze how values of

// the domain are related to the map that describes the dependency.

isl::map DepDeltaNegToBoundDeltas = isl::map::from_domain_and_range(

isl::manage(isl_set_neg(DepDelta.copy())), BoundDeltas);

isl::set Complement = DepDeltaNegToBoundDeltas.deltas();

MeinersburUnsubmitted

Done

Consider lexmin_pw_multi_aff/lexmax_pw_multi_aff

Meinersbur: Consider `lexmin_pw_multi_aff`/`lexmax_pw_multi_aff`

MeinersburUnsubmitted

Not Done

Why not BoundDeltas.subtract() instead of deltas?

Meinersbur: Why not `BoundDeltas.subtract()` instead of deltas?

gareevromanAuthorUnsubmitted

Done

As far as I understand, these operations are not equal.

deltas computes a set containing the differences between image elements and corresponding domain elements in the input. subtract computes a subtraction of sets.

For example, in the case of the following sets they compute the following:

BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31] }
isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, 0, 0, -1]}

BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
deltas: {Stmt_for_body15[31, 31, 31, 31, 31, 32]}

BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, -1, 0, 31]}

BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
deltas: {Stmt_for_body15[31, 31, 31, 32, 31, 0]}

These comment interferes with the comment about pw_multi_aff. Consequently, I replaced the usage of isl_map_deltas with operations on pw_multi_aff.

gareevroman: As far as I understand, these operations are not equal. deltas computes a set containing the…

const unsigned DimNum = unsignedFromIslSize(DepDelta.dim(isl::dim::set));

for (unsigned j = Pos + 1; j < DimNum; j++) {

if (!IndexSet->count(j)) {

// Check the difference between the image element and the domain element

// in the case of indexes, which do not describe the dependency.

if (DepDelta.plain_get_val_if_fixed(isl::dim::set, j).is_zero())

continue;

return false;

}

// In the case of other indexes, which describe the dependency,

// the difference between the image element and the domain element

// should be equal to the difference between lexicographic minimum and

// lexicographic maximum of the domain of the statement.

if (!Complement.plain_get_val_if_fixed(isl::dim::set, j).is_zero())

return false;

}

return true;

}

/// Check that dependencies correspond to the tensor contraction.

///

/// Check that there are only true dependencies of the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) ->

/// S(..., ki + 1, min(k(i + 1)), ..., min(kn), ...), where S is the SCoP

/// statement represented by @p Schedule. Such dependencies are produced by

/// the tensor contraction. Obtained indexes ki are stored into @p IndexSet.

///

/// @param Schedule The schedule of the SCoP statement.

/// @param D The SCoP dependencies.

/// @param Domain The domain of the statement.

/// @param IndexSet Obtained indexes ki, which describe the dependencies.

/// @return True if dependencies correspond to the tensor contraction

/// and false, otherwise.

static bool containsOnlyTcDeps(isl::map Schedule, const Dependences *D,

SmallDenseSet<int> *IndexSet, isl::set Domain) {

isl::union_map Dep = D->getDependences(Dependences::TYPE_RAW);

isl::union_map Red = D->getDependences(Dependences::TYPE_RED);

if (!Red.is_null())

Dep = Dep.unite(Red);

MeinersburUnsubmitted

Not Done

SmallDenseSet<int> *IndexSet, isl::set Domain) {

- isl::union_map Dep = D->getDependences(Dependences::TYPE_RAW);

- isl::union_map Red = D->getDependences(Dependences::TYPE_RED);

- if (!Red.is_null())

- Dep = Dep.unite(Red);

+ isl::union_map Dep = D->getDependences(Dependences::TYPE_RAW | Dependences::TYPE_RED);

isl::space DomainSpace = Schedule.get_space().domain();

Should we also check whether WAW, RAW dependences are incompatible?

Meinersbur: Should we also check whether WAW, RAW dependences are incompatible?

gareevromanAuthorUnsubmitted

Done

As far as I understand, that is not necessary, because subsequently we check that the statement has the form C(shuffle(I, J)) = E(A(shuffle(I, P)),B(shuffle(P, J))C(shuffle(I, J))), where E is an expression that contains reads from the tensors A, B, C, and an arbitrary number of reads from constants with respect to bundles I, J , and P.

I have added a comment that describes this.

"The form of anti and output dependencies is determined specified by the form the SCoP statement, which is checked by subsequent analysis."

gareevroman: As far as I understand, that is not necessary, because subsequently we check that the statement…

isl::space DomainSpace = Schedule.get_space().domain();

isl::space Space = DomainSpace.map_from_domain_and_range(DomainSpace);

isl::set DepDeltas = Dep.extract_map(Space).deltas();

unsigned DeltasDimNum = unsignedFromIslSize(DepDeltas.dim(isl::dim::set));

isl::pw_multi_aff LowerBound = Domain.lexmin_pw_multi_aff();

isl::pw_multi_aff BoundSub = Domain.lexmax_pw_multi_aff().sub(LowerBound);

MeinersburUnsubmitted

Not Done

lexmin/lexmax can be expensive. Wrap into a IslMaxOperationsGuard?

Meinersbur: lexmin/lexmax can be expensive. Wrap into a `IslMaxOperationsGuard`?

gareevromanAuthorUnsubmitted

Done

What is the maximal amount of computational steps we should use by default? I set it to 500000 according to DependenceInfo.cpp.

gareevroman: What is the maximal amount of computational steps we should use by default? I set it to 500000…

auto BoundDeltas = isl::manage(isl_set_from_pw_multi_aff(BoundSub.release()));

MeinersburUnsubmitted

Done

You seem to assume an functional relationship from here on. If that's the case, you can keep the type a pw_multi_aff which supports more functions that you may have missed such as pw_multi_aff::add

Meinersbur: You seem to assume an functional relationship from here on. If that's the case, you can keep…

for (int i = static_cast<int>(DeltasDimNum) - 1; i >= 0; i--) {

MeinersburUnsubmitted

Done

Consider using reverse(rangeIslSize(0,DeltasDimNum)) (from ISLTools.h).

Meinersbur: Consider using `reverse(rangeIslSize(0,DeltasDimNum))` (from ISLTools.h).

// In the case of the tensor contraction, the difference between image

// elements and domain elements lies on a hyperplane where a dimension

// has the fixed value one.

isl::set Intersection = DepDeltas.fix_si(isl::dim::set, i, 1);

if (Intersection.is_empty())

continue;

MeinersburUnsubmitted

Not Done

This is going to check whether each element out of Intersection is a contraction over dimension i. Don't we also need to check that every iteration out of the band i is contributing to that contraction?

Meinersbur: This is going to check whether each element out of `Intersection` is a contraction over…

gareevromanAuthorUnsubmitted

Done

Could you clarify what do you mean by the band i? Are these indexes ki, which describe the dependencies?

isTcDep checks that the dependency has the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) -> S(..., ki + 1, min(k(i + 1)), ..., min(kn), …)

gareevroman: Could you clarify what do you mean by the band i? Are these indexes ki, which describe the…

MeinersburUnsubmitted

Not Done

There is a check Intersection.is_empty() which is going to detect if a dependency is completely missing. But what detects that only some of the dependencies are present. Such as:

[p] -> { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and p != 0 }

{ Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and i0 < 42 }

(assuming contracting over i=0)

isReductionCarriedOverDim doesn't seem to check whether the dependency is over the complete domain either.

Meinersbur: There is a check `Intersection.is_empty()` which is going to detect if a dependency is…

gareevromanAuthorUnsubmitted

Not Done

Are dependencies determined by the form of memory accesses? In the isCorrectAccessMap function, we check that memory accesses aren't partial. Isn't it sufficient?

I've tried to check whether the dependency is over the complete domain though.

gareevroman: Are dependencies determined by the form of memory accesses? In the isCorrectAccessMap function…

if (!isTcDep(Intersection, i, BoundDeltas, IndexSet))

return false;

IndexSet->insert(i);

DepDeltas = DepDeltas.subtract(Intersection);

}

// In the case of the tensor contraction, all dependencies should have

// the previously described form.

if ((DeltasDimNum == 0) || !DepDeltas.is_empty())

return false;

return true;

}

/// Order the dimensions of operands of the tensor contraction.

///

/// To improve the spatial locality of the data, we use a heuristic, which is

/// introduced in [1]. The heuristic logically reorders the tensor dimensions

/// and consists of the following steps:

///

/// 1. Sort the dimensions in I and J by increasing stride in the tensor C.

/// 2. If tensor A contains the last dimension of C, swap A with B and I with J.

/// 3. Sort the dimensions in P by increasing stride in the tensor A.

///

/// [1] - Devin Matthews. High-Performance Tensor Contraction without BLAS.

/// SIAM Journal on Scientific Computing, 40(1). 2016. DOI: 10.1137/16M108968X.

///

/// @param TCI The information about tensors.

static void orderDimensions(TCInfoTy &TCI) {

// Sort the dimensions in I and J by increasing stride in the tensor C.

for (const int &Dimension : TCI.CDimensions) {

assert(!(TCI.I.count(Dimension) && TCI.J.count(Dimension)));

if (TCI.I.count(Dimension)) {

TCI.OrderedI.push_back(Dimension);

continue;

}

if (TCI.J.count(Dimension))

TCI.OrderedJ.push_back(Dimension);

}

// If tensor A contains the last dimension of C, swap A with B and I with J.

if (TCI.I.count(TCI.CDimensions.back())) {

MemoryAccess *Access = TCI.A;

TCI.A = TCI.B;

TCI.B = Access;

TCI.I.swap(TCI.J);

TCI.ADimensions.swap(TCI.BDimensions);

TCI.OrderedI.swap(TCI.OrderedJ);

}

// Sort the dimensions in P by increasing stride in the tensor A.

for (const int &Dimension : TCI.ADimensions)

if (TCI.P.count(Dimension))

TCI.OrderedP.push_back(Dimension);

}

/// Check if the SCoP statement could probably be optimized with analytical

/// modeling.

///

/// containsTC tries to determine whether the following conditions

/// are true:

///

/// 1. The last memory access modeling an array, MA1, represents writing to

/// memory and has the form S(..., I, ..., J, ...) -> M(shuffle(I, J)),

/// where S is the SCoP statement under consideration and shuffle(I, J)

/// is a permutation of indexes of sets I and J.

/// 2. There are only true dependencies of the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) ->

/// S(..., ki + 1, min(k(i + 1)), ..., min(kn), ...), where S is the SCoP

/// statement represented by @p Schedule and ki are indexes of the set P.

/// 3. SCoP contains only three access relations, MA2, MA3, and MA4 that

/// represent reading from memory and have the form

/// S(..., I, ..., P, ...) -> M(shuffle(I, P)),

/// S(..., P, ..., J, ...) -> M(shuffle(J, P)),

/// S(...) -> M(shuffle(I, J)), respectively.

///

/// @param PartialSchedule The PartialSchedule that contains a SCoP statement

/// to check.

/// @param D The SCoP dependencies.

/// @param TCI Parameters of the tensor contraction operands.

/// @param Domain The domain of the statement.

/// @return True if dependencies and memory accesses correspond to the tensor

/// contraction and false, otherwise.

static bool containsTCInfoTy(isl::map PartialSchedule, const Dependences *D,

TCInfoTy &TCI, isl::set Domain) {

if (!containsOnlyTcDeps(PartialSchedule, D, &TCI.P, Domain))

return false;

// TODO: handle cases of scalar multiplication if needed.

if (TCI.P.size() == 0)

return false;

if (!containsOnlyTCAcc(Domain, PartialSchedule, TCI))

return false;

// TODO: handle cases of GEMV if needed.

if ((TCI.I.size() == 0) || (TCI.J.size() == 0))

return false;

orderDimensions(TCI);

MeinersburUnsubmitted

Done

This is not part of the pattern detection, but the optimization. Could we move it to the patch that does the actual optimization?

Meinersbur: This is not part of the pattern detection, but the optimization. Could we move it to the patch…

return true;

}

/// Check if this node contains a partial schedule that could

/// probably be optimized with analytical modeling.

///

/// isTCPattern tries to determine whether the following conditions

/// are true:

/// 1. the partial schedule contains only one statement.

/// 2. there are exactly three input dimensions.

/// 3. there are four memory accesses that represent accesses to tensor

/// contraction operand.

MeinersburUnsubmitted

Done

Could you describe here what those 4 accesses are?

Meinersbur: Could you describe here what those 4 accesses are?

/// 4. all memory accesses of the statement except from the last one, are

/// read memory accesses and the last one is a write memory access.

/// 5. all subscripts of the last memory access of the statement do not

/// contain the variable used in the inner loop.

MeinersburUnsubmitted

Done

should not be necessary; any permutation of the surrounding loops can be valid. Eg,

for (w = 0; w < 64; ++w)
  for (l = 0; l < 64; ++l)
    for (i = 0; i < 1024; i++)
      for (j = 0; j < 1024; j++)
           C[i][j] += A[i][l][w] * B[w][j][l];

yields the same result.

Meinersbur: 5. should not be necessary; any permutation of the surrounding loops can be valid. Eg, ```…

///

MeinersburUnsubmitted

Done

Could you add a high-level description how the algorithm actually works? I.e. dependencies used to determine contraction dimensions, etc.

Meinersbur: Could you add a high-level description how the algorithm actually works? I.e. dependencies used…

/// If this is the case, we could use an approach that is similar to

/// the one used to get close-to-peak performance of matrix multiplications.

///

/// @param Node The node to check.

/// @param D The SCoP dependencies.

/// @param TCI Parameters of the tensor contraction operands.

static bool isTCPattern(isl::schedule_node Node, const Dependences *D,

TCInfoTy &TCI) {

Node = Node.child(0);

auto LeafType = isl_schedule_node_get_type(Node.get());

MeinersburUnsubmitted

Not Done

Prefer Node.isa<isl::schedule_node_leaf>() (and then typed subclass: Node.as<isl_schedule_node_leaf>())

Meinersbur: Prefer `Node.isa<isl::schedule_node_leaf>()` (and then typed subclass: `Node.

gareevromanAuthorUnsubmitted

Done

Could we factor out this condition into ScheduleTreeOptimizer::isPMOptimizableBandNode, since it is common for isTCPattern and isMatrMultPattern functions? A new version of the patch shows how it could look like.

gareevroman: Could we factor out this condition into ScheduleTreeOptimizer::isPMOptimizableBandNode, since…

isl::union_map PartialSchedule = Node.get_prefix_schedule_union_map();

isl::union_set Domain = Node.domain();

Node = Node.parent();

// The innermost band node is expected.

if (LeafType != isl_schedule_node_leaf ||

isl_union_map_n_map(PartialSchedule.get()) != 1)

return false;

MeinersburUnsubmitted

Done

This condition is effectively identical to the next

Meinersbur: This condition is effectively identical to the next

// The partial schedule should contain only one statement.

if (isl_union_set_n_set(Domain.get()) != 1)

MeinersburUnsubmitted

Done

This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle for now.

Meinersbur: This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle…

gareevromanAuthorUnsubmitted

Done

Could we add a TODO comment for this?

gareevroman: Could we add a TODO comment for this?

MeinersburUnsubmitted

Done

Yes, that would be great.

Meinersbur: Yes, that would be great.

gareevromanAuthorUnsubmitted

Done

Ok. I've left that TODO comment.

gareevroman: Ok. I've left that TODO comment.

return false;

MeinersburUnsubmitted

Done

/// 3. SCoP contains an arbitrary number of reads from constants and only three

- /// access relations, MA2, MA3, and MA4 that epresent reading from memory

+ /// access relations, MA2, MA3, and MA4 that represent reading from memory

/// and have the form

[typo]

Meinersbur: [typo]

auto HasFilterNode = false;

auto NodeType = isl_schedule_node_get_type(Node.get());

MeinersburUnsubmitted

Done

[style] No Almost-Always-Auto in LLVM's coding style.

Meinersbur: [style] [[ https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more…

// Check that all predecessors of the node contain all band nodes

MeinersburUnsubmitted

Done

auto NodeType = isl_schedule_node_get_type(Node.get());

- // Check that all predecessors of the node contain all band nodes

+ // Check that all ancestors of the node contain all band nodes

// for the statement, which represents the tensor contraction.

Meinersbur:

gareevromanAuthorUnsubmitted

Done

Looks like I missed that. Sorry. I will fix it in the next patch.

gareevroman: Looks like I missed that. Sorry. I will fix it in the next patch.

// for the statement, which represents the tensor contraction.

// Subsequently, such band nodes will be replaced by a single band node.

while (NodeType != isl_schedule_node_domain) {

MeinersburUnsubmitted

Not Done

This looks for the outermost node that is not a filter or band. Is it possible that while that outermost node is not a TC contraction, one of the inner ones might? What if the outermost node is a filter, looks like it would just return false in this case.

Meinersbur: This looks for the outermost node that is not a filter or band. Is it possible that while that…

gareevromanAuthorUnsubmitted

Done

If I am not mistaken, this only checks that all band nodes, which represent the statement, are not split by filter nodes. These accepts a straightforward implementation of TC with/without delicm. For example,

domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and

                              0 <= i1 <= 1799 and
                              0 <= i2 <= 2199;
Stmt_for_body3[i0, i1] :      0 <= i0 <= 1599 and
                              0 <= i1 <= 1799;
Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and
                              0 <= i1 <= 1799 }"

child:

sequence:
- filter: "{ Stmt_for_body3[i0, i1] }"
  child:
    schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]"
    permutable: 1
    coincident: [ 1, 1 ]
- filter: "{ Stmt_for_body3_last[i0, i1] }"
  child:
    schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]"
    permutable: 1
    coincident: [ 1, 1 ]
- filter: "{ Stmt_for_body8[i0, i1, i2] }"
  child:
    schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] },
                { Stmt_for_body8[i0, i1, i2] -> [(i1)] },
                { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]"
    permutable: 1
    coincident: [ 1, 1, 0 ]

domain: "{ Stmt2[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }"
child:

schedule: "[{ Stmt2[i0, i1, i2] -> [(i0)] }, { Stmt2[i0, i1, i2] -> [(i1)] }, { Stmt2[i0, i1, i2] -> [(i2)] }]"
permutable: 1
coincident: [ 1, 1, 0 ]

Sorry, I have not committed an updated version of the optimization of TC to my github repo. However, I believe that, if this is that case, we can safely replace all such nodes.

+ auto NodeType = isl_schedule_node_get_type(Node.get());
+ while ((NodeType != isl_schedule_node_domain) &&
+ (NodeType != isl_schedule_node_filter)) {
+ assert((NodeType != isl_schedule_node_sequence) &&
+ L"Prevent the undefined behavior");
+ Node = Node.parent();
+ NodeType = isl_schedule_node_get_type(Node.get());
+ }
+ Node = Node.child(0);
+ Node = isl::manage(isl_schedule_node_cut(Node.release()));
+ return Node.insert_partial_schedule(Dimensions);

I think taht the detection of a more sophisticated implementations of TC is a possible goal of a future research.

I have described this in the comment.

gareevroman: If I am not mistaken, this only checks that all band nodes, which represent the statement, are…

MeinersburUnsubmitted

Done

I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."

Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)

Meinersbur: I think some info in the comment like "all surrounding band nodes are assumed to be part of the…

gareevromanAuthorUnsubmitted

Done

I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."

I've added it it.

Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)

Sequence nodes could be necessary, if DeLICM was applied. Please, see the example inside the isTCPattern. Yes, I think other types except for marker nodes should be rejected. Additionally, as a precaution, I propose to check that a filter node has only a sequence and a domain nodes as its predecessors. I've updated the patch.

gareevroman: > I think some info in the comment like "all surrounding band nodes are assumed to be part of…

if (HasFilterNode && (NodeType == isl_schedule_node_band))

return false;

if (NodeType == isl_schedule_node_filter)

HasFilterNode = true;

Node = Node.parent();

NodeType = isl_schedule_node_get_type(Node.get());

}

isl::map PartialScheduleMap = isl::map::from_union_map(PartialSchedule);

if (containsTCInfoTy(PartialScheduleMap, D, TCI, isl::set(Domain)))

return true;

return false;

}

} // namespace } // namespace

isl::schedule_node isl::schedule_node

polly::tryOptimizeMatMulPattern(isl::schedule_node Node, polly::tryOptimizeMatMulPattern(isl::schedule_node Node,

const llvm::TargetTransformInfo *TTI, const llvm::TargetTransformInfo *TTI,

const Dependences *D) { const Dependences *D) {

TCInfoTy TCI;

if (PMBasedTCOpts && isTCPattern(Node, D, TCI))

LLVM_DEBUG(dbgs() << "The tensor contraction pattern was detected\n");

MatMulInfoTy MMI; MatMulInfoTy MMI;

if (isMatrMultPattern(Node, D, MMI)) { if (PMBasedMMMOpts && isMatrMultPattern(Node, D, MMI)) {

LLVM_DEBUG(dbgs() << "The matrix multiplication pattern was detected\n"); LLVM_DEBUG(dbgs() << "The matrix multiplication pattern was detected\n");

return optimizeMatMulPattern(Node, TTI, MMI); return optimizeMatMulPattern(Node, TTI, MMI);

} }

return {}; return {};

} }

MeinersburUnsubmitted

Done

What is Goto here? GotoBLAS?

Meinersbur: What is Goto here? GotoBLAS?

polly/lib/Transform/ScheduleOptimizer.cpp

Show First 20 Lines • Show All 296 Lines • ▼ Show 20 Lines	private:
/// Check if this node is a band node we want to tile.		/// Check if this node is a band node we want to tile.
///		///
/// We look for innermost band nodes where individual dimensions are marked as		/// We look for innermost band nodes where individual dimensions are marked as
/// permutable.		/// permutable.
///		///
/// @param Node The node to check.		/// @param Node The node to check.
static bool isTileableBandNode(isl::schedule_node Node);		static bool isTileableBandNode(isl::schedule_node Node);

		/// Check if this node is a band node we want to transform using pattern
		/// matching.
		///
		/// We look for innermost band nodes where individual dimensions are marked as
		/// permutable. There is no restriction on the number of individual
		/// dimensions.
		///
		/// @param Node The node to check.
		static bool isPMOptimizableBandNode(isl::schedule_node Node);

/// Pre-vectorizes one scheduling dimension of a schedule band.		/// Pre-vectorizes one scheduling dimension of a schedule band.
///		///
/// prevectSchedBand splits out the dimension DimToVectorize, tiles it and		/// prevectSchedBand splits out the dimension DimToVectorize, tiles it and
/// sinks the resulting point loop.		/// sinks the resulting point loop.
///		///
/// Example (DimToVectorize=0, VectorWidth=4):		/// Example (DimToVectorize=0, VectorWidth=4):
///		///
/// \| Before transformation:		/// \| Before transformation:
▲ Show 20 Lines • Show All 128 Lines • ▼ Show 20 Lines	if (isl_schedule_node_get_type(Child.get()) != isl_schedule_node_filter)
return false;		return false;
if (isl_schedule_node_get_type(Child.child(0).get()) !=		if (isl_schedule_node_get_type(Child.child(0).get()) !=
isl_schedule_node_leaf)		isl_schedule_node_leaf)
return false;		return false;
}		}
return true;		return true;
}		}

bool ScheduleTreeOptimizer::isTileableBandNode(isl::schedule_node Node) {		/// Check if this node is a permutable band node, which has only one child.
		///
		/// @param Node The node to check.
		static bool isPermutableOneTimeParentBandNode(isl::schedule_node Node) {
if (isl_schedule_node_get_type(Node.get()) != isl_schedule_node_band)		if (isl_schedule_node_get_type(Node.get()) != isl_schedule_node_band)
return false;		return false;

if (isl_schedule_node_n_children(Node.get()) != 1)		if (isl_schedule_node_n_children(Node.get()) != 1)
return false;		return false;

if (!isl_schedule_node_band_get_permutable(Node.get()))		if (!isl_schedule_node_band_get_permutable(Node.get()))
return false;		return false;

		return true;
		}

		bool ScheduleTreeOptimizer::isTileableBandNode(isl::schedule_node Node) {
		if (!isPermutableOneTimeParentBandNode(Node))
		return false;

auto Space = isl::manage(isl_schedule_node_band_get_space(Node.get()));		auto Space = isl::manage(isl_schedule_node_band_get_space(Node.get()));

if (unsignedFromIslSize(Space.dim(isl::dim::set)) <= 1u)		if (unsignedFromIslSize(Space.dim(isl::dim::set)) <= 1u)
return false;		return false;

MeinersburUnsubmitted Done Reply Inline Actions Instead of modifying the idea of whether a node is tilable, consider introducing another constraint-checking function, as we should have done with prevectorization as well. Meinersbur: Instead of modifying the idea of whether a node is tilable, consider introducing another…
return isSimpleInnermostBand(Node);		return isSimpleInnermostBand(Node);
}		}

		bool ScheduleTreeOptimizer::isPMOptimizableBandNode(isl::schedule_node Node) {
		if (!isPermutableOneTimeParentBandNode(Node))
		return false;

		return isSimpleInnermostBand(Node);
		}

__isl_give isl::schedule_node		__isl_give isl::schedule_node
ScheduleTreeOptimizer::applyTileBandOpt(isl::schedule_node Node) {		ScheduleTreeOptimizer::applyTileBandOpt(isl::schedule_node Node) {
if (FirstLevelTiling) {		if (FirstLevelTiling) {
Node = tileNode(Node, "1st level tiling", FirstLevelTileSizes,		Node = tileNode(Node, "1st level tiling", FirstLevelTileSizes,
FirstLevelDefaultTileSize);		FirstLevelDefaultTileSize);
FirstLevelTileOpts++;		FirstLevelTileOpts++;
}		}

Show All 29 Lines
__isl_give isl_schedule_node *		__isl_give isl_schedule_node *
ScheduleTreeOptimizer::optimizeBand(__isl_take isl_schedule_node *NodeArg,		ScheduleTreeOptimizer::optimizeBand(__isl_take isl_schedule_node *NodeArg,
void *User) {		void *User) {
const OptimizerAdditionalInfoTy *OAI =		const OptimizerAdditionalInfoTy *OAI =
static_cast<const OptimizerAdditionalInfoTy *>(User);		static_cast<const OptimizerAdditionalInfoTy *>(User);
assert(OAI && "Expecting optimization options");		assert(OAI && "Expecting optimization options");

isl::schedule_node Node = isl::manage(NodeArg);		isl::schedule_node Node = isl::manage(NodeArg);
if (!isTileableBandNode(Node))
return Node.release();

if (OAI->PatternOpts) {		if (OAI->PatternOpts && isPMOptimizableBandNode(Node)) {
isl::schedule_node PatternOptimizedSchedule =		isl::schedule_node PatternOptimizedSchedule =
tryOptimizeMatMulPattern(Node, OAI->TTI, OAI->D);		tryOptimizeMatMulPattern(Node, OAI->TTI, OAI->D);
if (!PatternOptimizedSchedule.is_null()) {		if (!PatternOptimizedSchedule.is_null()) {
MatMulOpts++;		MatMulOpts++;
return PatternOptimizedSchedule.release();		return PatternOptimizedSchedule.release();
}		}
}		}

		if (!isTileableBandNode(Node))
		return Node.release();

if (OAI->Postopts)		if (OAI->Postopts)
Node = applyTileBandOpt(Node);		Node = applyTileBandOpt(Node);

if (OAI->Prevect) {		if (OAI->Prevect) {
// FIXME: Prevectorization requirements are different from those checked by		// FIXME: Prevectorization requirements are different from those checked by
// isTileableBandNode.		// isTileableBandNode.
Node = applyPrevectBandOpt(Node);		Node = applyPrevectBandOpt(Node);
}		}
▲ Show 20 Lines • Show All 463 Lines • Show Last 20 Lines

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll

; RUN: opt %loadPolly \

; RUN: -polly-pattern-matching-based-opts=true \

; RUN: -polly-optree -polly-delicm -polly-simplify \

; RUN: -polly-opt-isl -debug < %s 2>&1 \

; RUN: -polly-opt-isl -polly-tc-opt=true -debug < %s 2>&1 \

; RUN: | FileCheck %s

; REQUIRES: asserts

; Check that the pattern matching detects the matrix multiplication pattern

; after a full run of -polly-optree and -polly-delicm, where the write access

; is not through the original memory access, but trough a PHI node that was

; delicmed. This test covers the polybench 2mm and 3mm cases.

;

; This test case generates the following schedule, which contans filters:

MeinersburUnsubmitted

Done

; delicmed. This test covers the polybench 2mm and 3mm cases.

;

- ; This test case generates the following schedule, which contans filters:

+ ; This test case generates the following schedule, which contains filters:

;

; domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and

Meinersbur:

;

; domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and

; 0 <= i1 <= 1799 and

; 0 <= i2 <= 2199;

; Stmt_for_body3[i0, i1] : 0 <= i0 <= 1599 and

; 0 <= i1 <= 1799;

; Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and

; 0 <= i1 <= 1799 }"

; child:

; sequence:

; - filter: "{ Stmt_for_body3[i0, i1] }"

; child:

; schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]"

; permutable: 1

; coincident: [ 1, 1 ]

; - filter: "{ Stmt_for_body3_last[i0, i1] }"

; child:

; schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]"

; permutable: 1

; coincident: [ 1, 1 ]

; - filter: "{ Stmt_for_body8[i0, i1, i2] }"

; child:

; schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] },

; { Stmt_for_body8[i0, i1, i2] -> [(i1)] },

; { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]"

; permutable: 1

; coincident: [ 1, 1, 0 ]

;

; CHECK: The tensor contraction pattern was detected

; CHECK: The matrix multiplication pattern was detected

;

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: norecurse nounwind uwtable

define void @kernel_2mm(i32 %ni, i32 %nj, i32 %nk, i32 %nl, double %alpha, double %beta, [1800 x double]* nocapture %tmp, [2200 x double]* nocapture readonly %A, [1800 x double]* nocapture readonly %B, [2400 x double]* nocapture readnone %C, [2400 x double]* nocapture readnone %D) local_unnamed_addr #0 {

▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll

This file was added.

; RUN: opt %loadPolly -polly-delicm -polly-simplify -polly-opt-isl \

; RUN: -polly-pattern-matching-based-opts=true \

; RUN: -polly-tc-opt=true -debug < %s 2>&1 | FileCheck %s

MeinersburUnsubmitted

Not Done

; RUN: opt %loadPolly -polly-delicm -polly-simplify -polly-opt-isl \

; RUN: -polly-pattern-matching-based-opts=true \

- ; RUN: -polly-tc-opt=true -debug < %s 2>&1 | FileCheck %s

+ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

; REQUIRES: asserts

Since this is not FileCheck-ing the LLVM-IR output, suppress it with -disable-output

Meinersbur: Since this is not FileCheck-ing the LLVM-IR output, suppress it with `-disable-output`

gareevromanAuthorUnsubmitted

Done

Could we fix the existing test cases in a separate patch?

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll

; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
; REQUIRES: asserts

polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll