This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
polly/
-
include/polly/Support/
-
polly/
-
Support/
-
ISLTools.h
-
lib/
-
Support/
1/1
ISLTools.cpp
-
Transform/
70/88
MatmulOptimizer.cpp
1/1
ScheduleOptimizer.cpp
-
test/ScheduleOptimizer/
-
ScheduleOptimizer/
1/1
pattern-matching-based-opts-after-delicm.ll
3/4
pattern-matching-based-opts-after-delicm_2.ll
-
pattern-matching-based-opts.ll
-
pattern-matching-based-opts_11.ll
-
pattern-matching-based-opts_15.ll
-
pattern-matching-based-opts_16.ll
-
pattern-matching-based-opts_17.ll
-
pattern-matching-based-opts_18.ll
-
pattern-matching-based-opts_19.ll
-
pattern-matching-based-opts_2.ll
-
pattern-matching-based-opts_20.ll
-
pattern-matching-based-opts_21.ll
-
pattern-matching-based-opts_22.ll
-
pattern-matching-based-opts_23.ll
-
pattern-matching-based-opts_24.ll
-
pattern-matching-based-opts_25.ll
-
pattern-matching-based-opts_4.ll

Differential D114336

[Polly] Generalize the pattern matching to the case of tensor contractions.
ClosedPublic

Authored by gareevroman on Nov 21 2021, 6:10 AM.

Download Raw Diff

Details

Reviewers

Meinersbur
bollu

Commits

rGb02c7e2b630a: [Polly] Generalize the pattern matching to the case of tensor contractions

Summary

The pattern matching optimization of Polly detects and optimizes dense general matrix-matrix multiplication. The generated code is close to high performance implementations of matrix-matrix multiplications, which are contained in manually tuned libraries [1]. The described pattern matching optimization is a particular case of tensor contraction optimization, which was introduced in [2].

This patch generalizes the pattern matching to the case of tensor contractions using the algorithm described in [2]. Following the ideas introduced in [3], it logically represents tensor contraction operands as matrix multiplication operands and uses the approach presented in [1].

Optimization of tensor contractions will be added in the next patch. These modifications can be found in https://github.com/gareevroman/llvm-project.

[1] - Low T.M., Igual F.D., Smith T.M., Quintana-Orti E.S. Analytical Modeling Is Enough for High-Performance BLIS // ACM Transactions on Mathematical Software. 2016. Vol. 43, no. 2. P. 12:1—12:18. DOI: 10.1145/2925987.

[2] - Gareev R., Grosser T., Kruse M. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach // ACM Transactions on Architecture and Code Optimization (TACO). 2018. Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.

[3] - Matthews D. High-Performance Tensor Contraction without BLAS // SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. C 1—C 24. DOI: 110.1137/16m108968x.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

gareevroman created this revision.Nov 21 2021, 6:10 AM

Herald added a reviewer: bollu. · View Herald TranscriptNov 21 2021, 6:10 AM

Herald added a subscriber: asbirlea. · View Herald Transcript

gareevroman requested review of this revision.Nov 21 2021, 6:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 21 2021, 6:10 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B135309: Diff 388756.Nov 21 2021, 6:23 AM

Thanks for upstreaming your tensor optimization!

The naming of polly-pattern-matching-based-opts suggests that it includes all pattern-based optimizations, yet this introduces another flag polly-pattern-matching-based-tc-opts. I'd prefer polly-pattern-matching-based-opts controlling both optimizations, and then additional flags for enabling matrix-multiplication and tensor optimizations. Alternatively, rename polly-pattern-matching-based-opts to e.g. polly-matmul-opt. Also, you are adding this functionality into a file called MatmulOptimizer and a function called tryOptimizeMatMulPattern. Since matrix-multiplication is a tensor contraction, is ts-opt supposed to supperseed matrix multiplication? In short, I would like to know what what the relation between those two optimizations should be. I'd prefer to not have to maintain two optimizations if one is strictly more powerful than the other.

I'd enjoy some occasional comments and grouping of statements (empty lines) inside the functions in addition to the doxygen comments. For instance isTCOperandAcc is just a wall of code. For such property-checking functions, ideally each return should mention what property is violated here and why this property is required to be compatible with tensor optimization.

polly/lib/Transform/MatmulOptimizer.cpp
186–187	Please add more details on what the members represent.
193–195	`std::set` is a high-overhead implementation. Consider using `DenseSet` or `SmallDenseSet`. See https://www.llvm.org/docs/ProgrammersManual.html#llvm-adt-denseset-h
196–202	Is there an argument to use 30 and small size? If not, consider using just `SmallVector<int>`.
1131	[style] No reason to make this a wide string literal, especially if just used as an assertion failed message. Apples to other occurrences as well.
1150
1156	or introduce `intFromIslSize`.
1189–1190	`SmallVectorImpl` is not specific to what the vector's small size is.
1200	Consider using `polly::rangeIslSize` for iterating over dimensions.
1227	Although already in an anon namespace, the other methods add `static` as well. I found it helps the compiler to warn if a static function is unnused.
1229–1230	Do you know of `#include <llvm/ADT/SetOperations.h>`? Unfortunately, these modify one set rather than returning a new set.
1259
1269–1270	This computes whether two sets a disjoint, it should not be required to compute the intersection.
1286
1324	`getAccessesInOrder` requires `Stmt` to not be a RegionStmt. Please add a test for it.
1334	If any of the returns are executed, what causes the pattern to be rejected (it's not `return false`)?
1403–1404
1405–1407	The check should not depend on `n_basic_set`, which is fragile and depends on whether on eg. `coalesce` was successful. Consider using something like `polly::getConstant`.
1451–1452	Consider `lexmin_pw_multi_aff`/`lexmax_pw_multi_aff`
polly/lib/Transform/ScheduleOptimizer.cpp
469–473	Instead of modifying the idea of whether a node is tilable, consider introducing another constraint-checking function, as we should have done with prevectorization as well.

Thank you very much for the review!

I've tried to address all comments. Additionally, I've updated the optimization for the case of filter nodes (e.g., polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll). I believe that the optimization of tensor contractions is strictly more powerful than the optimization of matrix-multiplications. So, I'd suggest to replace the optimization of matrix-multiplications with it eventually.

gareevroman marked 19 inline comments as done.Dec 12 2021, 1:30 AM

gareevroman added inline comments.

polly/lib/Transform/MatmulOptimizer.cpp
1156	As far I understand, Dimensions.size() returns a value of type size_t instead of a value of the type isl_size. So, in the new version I used the unsigned type to avoid the cast.
1269–1270	That check is redundant. Thanks.
1286	As far as I understand, we cannot do this here because of the assignment to TCI.ADimensions and TCI.BDimensions
1324	I’ve added a check to containsOnlyTCAcc. Could you clarify how the test case should look like? Should it be a region statement that contains a matrix multiplication with right order of memory accesses?
1405–1407	I think that this check was redundant. I’ve removed it.

gareevroman marked 5 inline comments as done.Dec 12 2021, 1:31 AM

Harbormaster completed remote builds in B138850: Diff 393734.Dec 12 2021, 1:41 AM

ping

I am extremely sorry for the late review, I hope you could use the time to work on/polish the follow-up patch. Some things are a bit hard to understand independently. I promise to review in a more timely manner from now on, although I might always add that many comments.

Please add to the tests cases what they are supposed to be testing and maybe give them more meaningful filenames.

The following is successfully detected as tensor contraction. Is this intended?

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; ++l)
              if (l != 0)
                for (int w = 0; w < 64; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

It might be if the codegen part is able exclude the element 0. In contrast, this one is rejected:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; l++)
             for (int w = 0; w < 64; ++w)
                if (w != 0)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

or this:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
     for (int i = 0; i < 1024; i++)
         for (int j = 0; j < 1024; j++)
           for (int l = 0; l < 64; l+=2)
             for (int w = 0; w < 64; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][l];
}

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
     for (int i = 0; i < 32; i++)
         for (int j = 0; j < 32; j++)
           for (int l = 0; l < 32; l++)
             for (int w = 0; w < 32; ++w)
                  C[i][j] += A[i][l][w] * B[w][j][i+3];
}

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
    for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
    for (int l = 0; l < 32; l++)
    for (int w = 0; w < 32; w++)
        C[i][j] += A[i][l][w] * B[w][j][i];
}

Is it possible to make it work with -polly-reschedule=0 as well? AFAICS there is nothing that requires rescheduling and would allow us to test the detection independently. isTCPattern seems to already consider multiple bands; the hurdle at the moment seems to be that the bands are not marked as permutable.

polly/lib/Transform/MatmulOptimizer.cpp
1630–1631	should not be necessary; any permutation of the surrounding loops can be valid. Eg, for (w = 0; w < 64; ++w) for (l = 0; l < 64; ++l) for (i = 0; i < 1024; i++) for (j = 0; j < 1024; j++) C[i][j] += A[i][l][w] * B[w][j][l]; yields the same result.
1642	Prefer `Node.isa<isl::schedule_node_leaf>()` (and then typed subclass: `Node.as<isl_schedule_node_leaf>()`)

Meinersbur added inline comments.Mar 28 2022, 6:37 PM

polly/lib/Transform/MatmulOptimizer.cpp
189–192	Please use doxygen comments to describe class/struct members. `@{ @}` can be used to group them. Also add an empty line before a new comment begins.
1156	`rangeIslSize` should make it easier.
1176	I like the idea of verifying the correctness by reconstructing and comparing to the original. Maybe do it at the end to verify that the entire `TCInfoTy` is correct? On the other size, earlier fail would be better. What do you think?
1201–1204	This is a weird way to find out which indices map to what other index, I guess the equivalent of `isMatMulOperandAcc`. It requires that the dimension number if part of the AccMap's range, and if there is any expression will fail (eg `Stmt[i][j] -> A[i-1][j+1]` matches the wrong dimensions), but at least there is the verification afterwards. I am not sure I like this sort of cleverness; I'd rather expect some sort of introspection into the map's coefficients, but I also think this should work in nearly all relevant cases and should be save due to the verification, so lets keep it. However, please document this better, eg. add an example on what is expected to happen.
1211–1212	The "plain" function are unfortunately not very robust, eg its result is different depending on the internal representation. I'd suggestion `getConstant` (from ISLTools) but only takes an pw_aff. Could you extract uses of `plain_get_val_if_fixed` into such a function, and mark it as TODO to cope with it later?
1259	I can use JScopImport to set a scalar memory access to a partial write without adding additional dependencies; that is, I don't think this can be just ignored. I suggest to have a single function that calls `getAccessesInOrder` and sort out which MemAccess is read/write in there, then analyze them.
1290–1292	Introduce a `is_superset` (etc) call?
1310–1312	Could we add utility functions such that this becomes `unite(J, P) == IndexSet`?
1324	Test in `containsOnlyTCAcc` is exactly what I was looking for. A region statement could look like this: c = C[i][j]; if (/non-affine condition/) { (void)A[i][k] + B[k][j]; } else { C[i][j] = c; } which has the correct order of accesses but is obviously not what we are looking for.
1414
1435	This seems to check setermine whether there is a reduction (contraction) carried by loop number `Pos`. The function name could be more meaningful. Suggestion: `isDepCarryingReductionOverDim` (not nice, but "TcDep" can mean anything)
1436	Consider passing by const-reference.
1439	`plain_get_val_if_fixed` is not really robust as it depends on the internal representation that can be different after eg simplify. Since this just checks to a specific value, the best would be to create a new set where all the fixed dimensions are that value (here: 1), and check whether `DepDelta` is a subset of it.
1450–1452	Why not `BoundDeltas.subtract()` instead of deltas?
1491–1494	Should we also check whether WAW, RAW dependences are incompatible?
1500–1501	lexmin/lexmax can be expensive. Wrap into a `IslMaxOperationsGuard`?
1502	You seem to assume an functional relationship from here on. If that's the case, you can keep the type a `pw_multi_aff` which supports more functions that you may have missed such as `pw_multi_aff::add`
1504	Consider using `reverse(rangeIslSize(0,DeltasDimNum))` (from ISLTools.h).
1505–1510	This is going to check whether each element out of `Intersection` is a contraction over dimension `i`. Don't we also need to check that every iteration out of the band `i` is contributing to that contraction?
1614	This is not part of the pattern detection, but the optimization. Could we move it to the patch that does the actual optimization?
1626–1627	Could you describe here what those 4 accesses are?
1632	Could you add a high-level description how the algorithm actually works? I.e. dependencies used to determine contraction dimensions, etc.
1647–1650	This condition is effectively identical to the next
1653	This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle for now.
1656–1657	[style] No Almost-Always-Auto in LLVM's coding style.
1662	This looks for the outermost node that is not a filter or band. Is it possible that while that outermost node is not a TC contraction, one of the inner ones might? What if the outermost node is a filter, looks like it would just `return false` in this case.
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll
13
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
2–4	Since this is not FileCheck-ing the LLVM-IR output, suppress it with `-disable-output`

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2022, 6:37 PM

Thank you very much for the review! I am sorry for the late response. I will try to to address all your comments within the next few weeks.

In D114336#3470843, @gareevroman wrote:

Thank you very much for the review! I am sorry for the late response. I will try to to address all your comments within the next few weeks.

No worries, I wasn't responsive either :-(

gareevroman updated this revision to Diff 428082.May 9 2022, 7:44 AM

Herald added subscribers: ormris, steven_wu, hiraditya. · View Herald TranscriptMay 9 2022, 7:44 AM

Harbormaster completed remote builds in B163485: Diff 428082.May 9 2022, 7:55 AM

The following is successfully detected as tensor contraction. Is this intended?

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {

for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; ++l)
         if (l != 0)
           for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];

}

Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

It might be if the codegen part is able exclude the element 0. In contrast, this one is rejected:

In this case, the codegen excludes the element 0 for i2. I added a test case for this.

domain: "{ Stmt4[i0, i1, i2, i3] : 0 <= i0 <= 1023 and 0 <= i1 <= 1023 and 0 < i2 <= 63 and 0 <= i3 <= 63 }"
…

In contrast, this one is rejected:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {

for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];

}

or this:

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l+=2)
        for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.

ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
      for (int l = 0; l < 32; l++)
        for (int w = 0; w < 32; ++w)
             C[i][j] += A[i][l][w] * B[w][j][i+3];

}

I fixed the isTCOperandAcc function and checked that all other asserts are used properly.

I do get an assertion failure with this one:

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {

for (int i = 0; i < 32; i++)
    for (int j = 0; j < 32; j++)
      for (int l = 0; l < 32; l++)
        for (int w = 0; w < 32; ++w)
             C[i][j] += A[i][l][w] * B[w][j][i+3];

}

I fixed the isTCOperandAcc function and checked that all other asserts are used properly.

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?

void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];
}

For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.

1347 static bool containsOnlyTCAcc(isl::set Domain, isl::map PartialSchedule,

                            TCInfoTy &TCI) {
...

1365 if (intersect(IandJIndexSet, TCI.P).size() != 0)
1366 return false;

I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

polly/lib/Transform/MatmulOptimizer.cpp
1156	I think rangeIslSize can’t be used in this case. However, I’ve tried to use rangeIslSize to improve the patch.
1176	Other parts of TCInfoTy are verified in isTCOperandAcc too. I think that it would be better to verify the related information in one place as much and as early as possible. Probably, the earlier fail would simplify the debugging, since we exactly know the form of memory accesses and can rely on it. Additionally, the performance can be improved, since the earlier fail helps to avoid additional operations with sets.
1259	I am not sure whether modifications of implementations of tensor contractions, which contain read and write scalar memory accesses, are useful in practice. Moreover, since bundles of induction variables I, J, P can contain an unlimited number of dimensions, we possibly cannot follow the algorithm from the containsOnlyMatrMultAcc function, which permutes dimensions and checks that additional memory accesses have stride 0 in terms of dimensions MMM.i, MMM.j, and MMM.k. Consequently, such memory accesses can be treated as scalar memory accesses. I have not come up with an effective alternative yet. That is why I do not consider scalar memory accesses in getWriteAccess and setReadAccesses functions. Could we mark it as TODO and do it future?
1324	Thanks for the example! I have added a corresponding test case. If I am not mistaken, it requires DeLICM.
1435	Could we name it isReductionCarriedOverDim? I think, in this case, we should rename the parameter Pos to Dim to make it more readable.
1450–1452	As far as I understand, these operations are not equal. deltas computes a set containing the differences between image elements and corresponding domain elements in the input. subtract computes a subtraction of sets. For example, in the case of the following sets they compute the following: BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31] } isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, 0, 0, -1]} BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]} deltas: {Stmt_for_body15[31, 31, 31, 31, 31, 32]} BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31]} isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, -1, 0, 31]} BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]} deltas: {Stmt_for_body15[31, 31, 31, 32, 31, 0]} These comment interferes with the comment about pw_multi_aff. Consequently, I replaced the usage of isl_map_deltas with operations on pw_multi_aff.
1491–1494	As far as I understand, that is not necessary, because subsequently we check that the statement has the form C(shuffle(I, J)) = E(A(shuffle(I, P)),B(shuffle(P, J))C(shuffle(I, J))), where E is an expression that contains reads from the tensors A, B, C, and an arbitrary number of reads from constants with respect to bundles I, J , and P. I have added a comment that describes this. "The form of anti and output dependencies is determined specified by the form the SCoP statement, which is checked by subsequent analysis."
1500–1501	What is the maximal amount of computational steps we should use by default? I set it to 500000 according to DependenceInfo.cpp.
1505–1510	Could you clarify what do you mean by the band i? Are these indexes ki, which describe the dependencies? isTcDep checks that the dependency has the form /// S(..., ki, max(k(i + 1)), ..., max(kn), ...) -> S(..., ki + 1, min(k(i + 1)), ..., min(kn), …)
1642	Could we factor out this condition into ScheduleTreeOptimizer::isPMOptimizableBandNode, since it is common for isTCPattern and isMatrMultPattern functions? A new version of the patch shows how it could look like.
1653	Could we add a TODO comment for this?
1662	If I am not mistaken, this only checks that all band nodes, which represent the statement, are not split by filter nodes. These accepts a straightforward implementation of TC with/without delicm. For example, domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and 0 <= i1 <= 1799 and 0 <= i2 <= 2199; Stmt_for_body3[i0, i1] : 0 <= i0 <= 1599 and 0 <= i1 <= 1799; Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and 0 <= i1 <= 1799 }" child: sequence: - filter: "{ Stmt_for_body3[i0, i1] }" child: schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]" permutable: 1 coincident: [ 1, 1 ] - filter: "{ Stmt_for_body3_last[i0, i1] }" child: schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]" permutable: 1 coincident: [ 1, 1 ] - filter: "{ Stmt_for_body8[i0, i1, i2] }" child: schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] }, { Stmt_for_body8[i0, i1, i2] -> [(i1)] }, { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]" permutable: 1 coincident: [ 1, 1, 0 ] domain: "{ Stmt2[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }" child: schedule: "[{ Stmt2[i0, i1, i2] -> [(i0)] }, { Stmt2[i0, i1, i2] -> [(i1)] }, { Stmt2[i0, i1, i2] -> [(i2)] }]" permutable: 1 coincident: [ 1, 1, 0 ] Sorry, I have not committed an updated version of the optimization of TC to my github repo. However, I believe that, if this is that case, we can safely replace all such nodes. + auto NodeType = isl_schedule_node_get_type(Node.get()); + while ((NodeType != isl_schedule_node_domain) && + (NodeType != isl_schedule_node_filter)) { + assert((NodeType != isl_schedule_node_sequence) && + L"Prevent the undefined behavior"); + Node = Node.parent(); + NodeType = isl_schedule_node_get_type(Node.get()); + } + Node = Node.child(0); + Node = isl::manage(isl_schedule_node_cut(Node.release())); + return Node.insert_partial_schedule(Dimensions); I think taht the detection of a more sophisticated implementations of TC is a possible goal of a future research. I have described this in the comment.
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
2–4	Could we fix the existing test cases in a separate patch? polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s ; REQUIRES: asserts polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s polly/test/ScheduleOptimizer/pattern-matching-based-opts_17.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s polly/test/ScheduleOptimizer/pattern-matching-based-opts_18.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s polly/test/ScheduleOptimizer/pattern-matching-based-opts_19.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s polly/test/ScheduleOptimizer/pattern-matching-based-opts_20.ll ; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s

ormris removed a subscriber: ormris.May 16 2022, 10:58 AM

In D114336#3500858, @gareevroman wrote:

1
Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

Please document what pattern is intended to be recognized. I don't think the doc for isTCPattern is sufficient, it only mentioned what is checked. Documenting the intended pattern would help identifying if a check has been forgotten. E.g. for the statement domain.

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.

ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };

What do you mean by "codegen modifies some memory accesses"? Polly's Codegen? What is the check to exclude this? Is the 1 + i3 memory access expression? Where does it come from?

4

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?
void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];
}
For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.

Should or should not be detected?

While debugging, it is now rejected because of this:
``

if (!TCI.B) {
  // IndexSet should be a union of J and P sets.
  if (unite(TCI.P, TCI.J) != IndexSet)
    return false;

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J. `op=` is a commutative operation ...., c is an affine condition usually just `true`. `f` is a side-effect free operation.

(I don't whether this is correct, I want to understand whether the checked conditions are sufficient).

5

I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

Ok.

polly/lib/Transform/MatmulOptimizer.cpp
210	`@{` is not needed when documenting just a single member.
1286	The assignments should just make a copy of the array . With `Dimensions` being passed by-value, the caller has to make the copy which it should not need to. `SmallVector` has no overload for being assigned an `ArrayRef`, but you could use `llvm::append_range` to insert all the values.
1331	Compiler warning: /home/meinersbur/src/llvm-project/polly/lib/Transform/MatmulOptimizer.cpp:1310:13: warning: moving a temporary object prevents copy elision [-Wpessimizing-move] TCI.I = std::move(set_difference(IndexSet, TCI.P)); The result of `set_difference` is already an r-value, no need to cast it to an r-value.
1336	Same compiler warning.
1435	Sounds ok.

gareevroman updated this revision to Diff 436206.Jun 12 2022, 3:58 AM

In D114336#3517540, @Meinersbur wrote:

In D114336#3500858, @gareevroman wrote:

1
Yes, it was intended. The transformation helps to optimize a class of programs, which is broader then a tensor contraction. However, it heavily depends on the codegen part. I think that the improvement of the detection can be the goal of the future work.

Please document what pattern is intended to be recognized. I don't think the doc for isTCPattern is sufficient, it only mentioned what is checked. Documenting the intended pattern would help identifying if a check has been forgotten. E.g. for the statement domain.

I've added a description of the TC-like kernel, which is the intended pattern, to the doc for isTCPattern function. I've added additional remarks according to your comments and restrictions of the current implementation.

As far as I know, in these cases, the codegen modifies some memory accesses. Consequently, they are not correspond to the current pattern.
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef0[i0, i2, 1 + i3] };
ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef1[1 + i3, i1, i2] };
ReadAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
MustWriteAccess :=	[Reduction Type: +] [Scalar: 0]
    { Stmt3[i0, i1, i2, i3] -> MemRef2[i0, i1] };
What do you mean by "codegen modifies some memory accesses"? Polly's Codegen?

Sorry, I meant ScopBuilder. In the following case

void foo(double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; ++l)
         if (l != 0)
           for (int w = 0; w < 64; ++w)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

ScopBuilder generates the following memory accesses, which correspond to the pattern:

{ Stmt4[i0, i1, i2, i3] -> MemRef0[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt4[i0, i1, i2, i3] -> MemRef3[o0, o1, o2] : o0 = i3 and o1 = i1 and o2 = i2 }
{ Stmt4[i0, i1, i2, i3] -> MemRef2[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = i3 }
{ Stmt4[i0, i1, i2, i3] -> MemRef0[o0, o1] : o0 = i0 and o1 = i1 }

If we changes that code a bit

void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}

ScopBuilder generates the following memory accesses:

{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }

In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

What is the check to exclude this?

They will be rejected by isTCOperandAcc. If we fix the output dimensions, values of output dimensions will not form a permutation of a subset of values of input dimensions. Please see comments inside this function.

Is the 1 + i3 memory access expression?

Yes, I believe so.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

domain: "{ Stmt3[i0, i1, i2, i3] : 0 <= i0 <= 1023 and 0 <= i1 <= 1023 and 0 <= i2 <= 63 and 0 <= i3 <= 62 }"

4

Here, i occurs as indices for A, B, and C and detected as TC. Is this supported?
void foo(double C[64][64], double A[64][64][64], double B[64][64][64]) {
for (int i = 0; i < 32; i++)
for (int j = 0; j < 32; j++)
for (int l = 0; l < 32; l++)
for (int w = 0; w < 32; w++)
    C[i][j] += A[i][l][w] * B[w][j][i];
}
For some reason, I cannot reproduce that. I have added a corresponding test case. As far as I understand, that should be detected because of the line 1365.
Should or should not be detected?

That should not be detected, because the intersection of free and contracted indices should always be empty. We check this at the line "if (intersect(IandJIndexSet, TCI.P).size() != 0)".

While debugging, it is now rejected because of this:
``
if (!TCI.B) {
  // IndexSet should be a union of J and P sets.
  if (unite(TCI.P, TCI.J) != IndexSet)
    return false;
``

You’re right. Thank you. "if (intersect(IandJIndexSet, TCI.P).size() != 0)" doesn’t help in this case. In that example, we have dependencies of the form:

{ Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : (o0 = i0 and o1 = i1 and o2 = i2 and o3 = 1 + i3 and i0 >= 0 and i0 <= 31 and i1 >= 0 and i1 <= 31 and i2 >= 0 and i2 <= 31 and i3 >= 0 and i3 <= 30) or (i3 = 31 and o0 = i0 and o1 = i1 and o2 = 1 + i2 and o3 = 0 and i0 >= 0 and i0 <= 31 and i1 >= 0 and i1 <= 31 and i2 >= 0 and i2 <= 30) }

the isl ast has the form:

{ domain: "{ Stmt3[i0, i1, i2, i3] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 and 0 <= i3 <= 31 }", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i0)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i1)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i2)] }]", child: { mark: "Loop with Metadata", child: { schedule: "[{ Stmt3[i0, i1, i2, i3] -> [(i3)] }]" } } } } } } } } }
if (1 && (&MemRef5[31][31][32] <= &MemRef0[0][0] || &MemRef0[31][32] <= &MemRef5[0][0][0]) && (&MemRef4[31][31][32] <= &MemRef0[0][0] || &MemRef0[31][32] <= &MemRef4[0][0][0]))

    // Loop with Metadata
    for (int c0 = 0; c0 <= 31; c0 += 1) {
      // Loop with Metadata
      for (int c1 = 0; c1 <= 31; c1 += 1) {
        // Loop with Metadata
        for (int c2 = 0; c2 <= 31; c2 += 1) {
          // Loop with Metadata
          for (int c3 = 0; c3 <= 31; c3 += 1)
            Stmt3(c0, c1, c2, c3);
        }
      }
    }

else
    {  /* original code */ }

Consequently, only "l" and "w" are treated as "contracted indices", which are stored in TCI.P. Sorry, I missed that.

If indexes of an operand of the tensor contraction don’t contain TCI.P, we don't accept the program. We check this in lines

…
if (!isSuperset(IndexSet, TCI.P))
      return false;
…

…
if (unite(TCI.P, TCI.J) != IndexSet)
      return false;
…

unite(TCI.P, TCI.J) still isn't equal to IndexSet in the considered case, because IndexSet doesn't contain the "l".

If we didn't have the "l-loop", unite(TCI.P, TCI.J) still would not be equal to IndexSet, which would contain i, j and w, because TCI.I would contain only i and TCI.J would contain only j.

test/ScheduleOptimizer/pattern-matching-based-opts_21.ll is the corresponding test case. Additionally, I've added test/ScheduleOptimizer/pattern-matching-based-opts_25.ll.

P.S.: I had to use the -fno-unroll-loops option of clang. Otherwise, one of the loops is optimized out on my machine.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

[1] - Gareev R., Grosser T., Kruse M. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach ACM Transactions Architecture and Code Optimization (TACO). 2018. Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.
[2] - Matthews D. High-Performance Tensor Contraction without BLAS SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. C 1—C 24. DOI: 110.1137/16m108968x.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.

I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

(I don't whether this is correct, I want to understand whether the checked conditions are sufficient).

> 5
> 
> I think that that it is redundant to require that bands are marked as permutable, since we check the form of dependencies and memory accesses. I propose to remove such checks for pattern matching optimizations.

Ok.

polly/lib/Transform/MatmulOptimizer.cpp
1286	I think we can use llvm::replace to avoid clearing the vector and preserve the logic.

Harbormaster completed remote builds in B169299: Diff 436206.Jun 12 2022, 4:14 AM

In D114336#3576199, @gareevroman wrote:
void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}
ScopBuilder generates the following memory accesses:
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }
In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.

Looks like some other optimization (maybe JumpThreading?) modifies the loop range. Ideally, the detection would be robust enough to not depend on the whether the domain space has an offset.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

In a paper the names must be shorter to fit on the page and a described close to the figure. It also should not be necessary to have access to the paper to understand the algorithm. For narrowly scoped variables such as loop counters single letters might be ok because the definition is likely on the same screen (or for a paper: the same page), but globals should be more identifiable. paper [1] also does not make the connection to the symbols it uses and what they correspond to in Polly's data structures.

However, I don't request such a change atm.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.

I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

It is also 'redundant' with the code itself, but it helps understanding it.

The description of I and P are "Input dimensions of the schedule space, which represent free indices of tensors." One has to know what "free indices" are, indices of what, etc.

The "definition" of isTCPattern is declarative, does not even explain what all those symbols are, and refers to our matmul paper which does not apply because it only is a special case of the TC pattern.

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

This is exactly the sort of thin I would want to clarify.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

It corresponds to a filter node in the TC body. You have used the if (w != 0) to illustrate where the access function deviates from the usually pattern which implied that it would be something you would like to support. If not, that would not be part of the pattern.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

f is not necessarily a function call, but as mentioned a "operation" representing the calculation done in the the TC body.

Side-effect here means something different than ScopDetection. A write to an unrelated array D would be a side-effect for the TC, but accurately represented by a Scop.
We could allow unknown side-effects in polly in the future with a general "memory" dependency, an extension to what Polly already does with -polly-allow-modref-calls. These could just not be reordered relative to each other.

I really think the documentation should be better. I had a hard time fixing bugs in the matmul optimization just with understanding what the code is supposed to be doing after a long time and would prefer to no repeat that again. See rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9 and rGa56bd7dec8da4348d847d53c96d8a30f4a821d36.

polly/lib/Support/ISLTools.cpp
266	nice
polly/lib/Transform/MatmulOptimizer.cpp
1259	The concern is that I can modify what `isLatestArrayKind()` returns by simply importing a JScop. The `continue` just ignores such weirdness but I think it is safer to fail in this case. You yourself mention that scalar accesses are likely not useful, so why not fail when one is found instead (`return null` instead of `continue`)? Some exceptions may be possible, such as read-only scalars (`VirtualUse::ReadOnly`, `VirtualUse::Synthesizable`)
1505–1510	There is a check `Intersection.is_empty()` which is going to detect if a dependency is completely missing. But what detects that only some of the dependencies are present. Such as: [p] -> { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and p != 0 } or { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and i0 < 42 } (assuming contracting over `i=0`) `isReductionCarriedOverDim` doesn't seem to check whether the dependency is over the complete domain either.
1653	Yes, that would be great.
1659
1662	I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes." Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
2–4	👍

gareevroman updated this revision to Diff 448777.Jul 29 2022, 11:43 PM

In D114336#3621094, @Meinersbur wrote:
In D114336#3576199, @gareevroman wrote:
void foo(int n, double C[1024][1024], double A[1024][64][64], double B[64][1024][64]) {
for (int i = 0; i < 1024; i++)
    for (int j = 0; j < 1024; j++)
      for (int l = 0; l < 64; l++)
        for (int w = 0; w < 64; ++w)
           if (w != 0)
             C[i][j] += A[i][l][w] * B[w][j][l];
}
ScopBuilder generates the following memory accesses:
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef2[o0, o1] : o0 = i0 and o1 = i1 }
{ Stmt3[i0, i1, i2, i3] -> MemRef1[o0, o1, o2] : o0 = 1 + i3 and o1 = i1 and o2 = i2 }
{ Stmt3[i0, i1, i2, i3] -> MemRef0[o0, o1, o2] : o0 = i0 and o1 = i2 and o2 = 1 + i3 }
In the context of the previous discussion, I meant that memory accesses are modified in comparison to the previous considered case.

Where does it come from?

If we look at the domain for the i3 variable, we see that the value 0 from the domain of w-loop is excluded and the loop bounds are modified to start from 0. Memory accesses correspond to this.
Looks like some other optimization (maybe JumpThreading?) modifies the loop range. Ideally, the detection would be robust enough to not depend on the whether the domain space has an offset.

I agree. Improving the detection is a possible goal of a future work.

Could you choose more meaningful identifiers than I, J, P and J, or use them in the pattern described in isTCPattern? I think of something like:

I’ve added a description of bundles I, J, and P to the description of the pattern. I hope it clarifies their purpose. I propose to apply the terminology, which is used in the paper [1] and it predecessors (e.g., [2]), to simplify the understanding of the code for their readers.

In a paper the names must be shorter to fit on the page and a described close to the figure. It also should not be necessary to have access to the paper to understand the algorithm. For narrowly scoped variables such as loop counters single letters might be ok because the definition is likely on the same screen (or for a paper: the same page), but globals should be more identifiable. paper [1] also does not make the connection to the symbols it uses and what they correspond to in Polly's data structures.

However, I don't request such a change atm.

Ok. I've tried to make the description of the algorithm self-consistent.

for (...) {
  ...
  for (...) {
    if (c)
      auto acc = C[P]; // ReadFromC
      auto a = A[Pa, I];
      auto b = B[Pb,J];
      auto arg = f(a,b);
      acc = acc op arg;
      C[P] = acc; // WriteToC
    }
  }
}

where P, I, J are sets of indices of the surrounding loops, Pa and Pb are subsets of P, I are the indices only occurring in the subscript of reading from A, J are the indices only occurring in the subscript for reading from B. There must be no indices not occuring in either P, I or J.
I think the definition of the TC-like corresponds to the information about sets P, I, J. Probably, it’s redundant to introduce the terminology for subsets at this point. Could we do this in the description of the optimization, if it’d be needed?

It is also 'redundant' with the code itself, but it helps understanding it.

The description of I and P are "Input dimensions of the schedule space, which represent free indices of tensors." One has to know what "free indices" are, indices of what, etc.

The "definition" of isTCPattern is declarative, does not even explain what all those symbols are, and refers to our matmul paper which does not apply because it only is a special case of the TC pattern.

I've tried to improve the description.

op= is a commutative operation ....,

I think this a redundant condition. However, you can find it in the paper [1]. I believe that, if preserve the order of loops with indexes from the bundle P during the optimization, there would not be any violation.

This is exactly the sort of thin I would want to clarify.

We don't check for associativity, because it's difficult and not necessary for the optimization. The optimization doesn't change the order of loops with indexes from the bundle P during the optimization, even if you parallize the outermost loop. Hence, it doesn't violate anything. For the same reason, we don't check for associativity in the case of the optimization of the generalization of matrix-matrix multiplication, which is currently used in Polly.

c is an affine condition usually just true.

Could you elaborate on that? I’m not sure that I understand where such a condition is used.

It corresponds to a filter node in the TC body. You have used the if (w != 0) to illustrate where the access function deviates from the usually pattern which implied that it would be something you would like to support. If not, that would not be part of the pattern.

Ok. Unfortunately, the current approach does't support this. So, it's not the part of the pattern.

f is a side-effect free operation.

If I’m not mistaken, according to, for example, ScopDetection.cpp, only side effect free functions calls can be located inside a Scop.

f is not necessarily a function call, but as mentioned a "operation" representing the calculation done in the the TC body.

Side-effect here means something different than ScopDetection. A write to an unrelated array D would be a side-effect for the TC, but accurately represented by a Scop.
We could allow unknown side-effects in polly in the future with a general "memory" dependency, an extension to what Polly already does with -polly-allow-modref-calls. These could just not be reordered relative to each other.

I agree that only side-effect free operations are considered in the pattern. Nevertheless, I propose not to use terms that may require an additional specification. I've tried to improve the description of the pattern.

I really think the documentation should be better. I had a hard time fixing bugs in the matmul optimization just with understanding what the code is supposed to be doing after a long time and would prefer to no repeat that again. See rGcad9f98a2ad98fecf663e9ce39502b8e43676fc9 and rGa56bd7dec8da4348d847d53c96d8a30f4a821d36.

Sure. Let's continue improving it.

Harbormaster completed remote builds in B178388: Diff 448777.Jul 29 2022, 11:54 PM

gareevroman marked 4 inline comments as done.Jul 29 2022, 11:56 PM

gareevroman added inline comments.

polly/lib/Transform/MatmulOptimizer.cpp
1259	I've added such a return statement to avoid scalar write memory accesses. Sorry, I was wrong. We need scalar read memory accesses. For example, in the case of the following matrix-matrix multiplication, a SCoP statement, which represents the body of the loop, contains the constant alpha. C = alphaAB Could we accept non-partial scalar read memory accesses? I think this is legal.
1505–1510	Are dependencies determined by the form of memory accesses? In the isCorrectAccessMap function, we check that memory accesses aren't partial. Isn't it sufficient? I've tried to check whether the dependency is over the complete domain though.
1653	Ok. I've left that TODO comment.
1659	Looks like I missed that. Sorry. I will fix it in the next patch.
1662	I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes." I've added it it. Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok) Sequence nodes could be necessary, if DeLICM was applied. Please, see the example inside the isTCPattern. Yes, I think other types except for marker nodes should be rejected. Additionally, as a precaution, I propose to check that a filter node has only a sequence and a domain nodes as its predecessors. I've updated the patch.
polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll
2–4	Ok. I've added this to my TODO list.

Thank you Gareev. I think the description can still be improved, I but we should also move forward and can improve iteratively.

Looking forward for the actual TC optimization.

polly/lib/Transform/MatmulOptimizer.cpp
199	AFAIU multiplication by β is not part of this detection, but required to be loop-distributed by the isl scheduler.
1181
1324	It does not require DeLICM, but `-polly-allow-nonaffine-branches` (which is enabled by default)
1654	[typo]
1735	What is Goto here? GotoBLAS?

This revision is now accepted and ready to land.Aug 2 2022, 11:55 AM

Closed by commit rGb02c7e2b630a: [Polly] Generalize the pattern matching to the case of tensor contractions (authored by gareevroman). · Explain WhyAug 7 2022, 4:22 AM

This revision was automatically updated to reflect the committed changes.

gareevroman marked 2 inline comments as done.

gareevroman added a commit: rGb02c7e2b630a: [Polly] Generalize the pattern matching to the case of tensor contractions.

In D114336#3694323, @Meinersbur wrote:

Thank you Gareev. I think the description can still be improved, I but we should also move forward and can improve iteratively.

Looking forward for the actual TC optimization.

Thanks! I've tried to address new comments in the committed patch.

polly/lib/Transform/MatmulOptimizer.cpp

199

Yes, it's not. I've added a comment about this.

1324

If I'm not mistaken, in your example the form of the dependencies doesn't correspond to the pattern.

c = C[i][j];
if (/*non-affine condition*/) {
  A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}

MayWrite: { Stmt_for_body8TOfor_inc[i0, i1, i2] -> MemRef_C[i0, i1] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }

I've added a slightly modified version of it to polly/test/ScheduleOptimizer/pattern-matching-based-opts_23.ll. It produces a region statement too.

for (int i = 0; i < 32; i++)
  for (int j = 0; j < 32; j++)
    for (int k = 0; k < 32; k++) {
      int c = C[i][j];
      if (i*j*k < 10) {
        C[i][j] = A[i][k] + B[k][j];
      } else {
        C[i][j] = c;
      } 
}

However, it introduces store merge phi nodes. It makes DeLICM necessary.

Statements {
	Stmt_for_body8__TO__if_end
        Domain :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> [i0, i1, i2, 0] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_A[i0, i2] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_B[i2, i1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
	Stmt_if_end
        Domain :=
            { Stmt_if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_if_end[i0, i1, i2] -> [i0, i1, i2, 1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };

Meinersbur added inline comments.Aug 8 2022, 3:53 PM

polly/lib/Transform/MatmulOptimizer.cpp
1324	The `storemerge` PHI node is introduced by one of the canonicalization passes (in this case: InstCombine). It it possible to not run that pass, disable the matching of this particular pattern, or use some other trick to not be matched. In any case, we cannot rely on the InstCombine to happen. It might be safer to bail out if any RegionStmt is encountered.

gareevroman marked an inline comment as done.Aug 14 2022, 3:39 AM

gareevroman added inline comments.

polly/lib/Transform/MatmulOptimizer.cpp
1324	I see. I haven't managed to fix that test case. So, I've decided to remove it. I've left the check that bails out if any RegionStmt is encountered in containsOnlyTCAcc.

gareevroman mentioned this in rGa5d981045de7: [Polly] Remove the test case that depends on InstCombine and DeLICM..Aug 14 2022, 6:07 AM

Revision Contents

Path

Size

polly/

include/

polly/

Support/

ISLTools.h

5 lines

lib/

Support/

ISLTools.cpp

12 lines

Transform/

MatmulOptimizer.cpp

812 lines

ScheduleOptimizer.cpp

36 lines

test/

ScheduleOptimizer/

pattern-matching-based-opts-after-delicm.ll

32 lines

pattern-matching-based-opts-after-delicm_2.ll

108 lines

pattern-matching-based-opts.ll

9 lines

pattern-matching-based-opts_11.ll

4 lines

pattern-matching-based-opts_15.ll

4 lines

pattern-matching-based-opts_16.ll

64 lines

pattern-matching-based-opts_17.ll

64 lines

pattern-matching-based-opts_18.ll

84 lines

pattern-matching-based-opts_19.ll

84 lines

pattern-matching-based-opts_2.ll

4 lines

pattern-matching-based-opts_20.ll

94 lines

pattern-matching-based-opts_21.ll

64 lines

pattern-matching-based-opts_22.ll

65 lines

pattern-matching-based-opts_23.ll

79 lines

pattern-matching-based-opts_24.ll

65 lines

pattern-matching-based-opts_25.ll

56 lines

pattern-matching-based-opts_4.ll

10 lines

Diff 450615

polly/include/polly/Support/ISLTools.h

	Show First 20 Lines • Show All 517 Lines • ▼ Show 20 Lines
	/// Subtract the parameter space @p Params from @p Set.			/// Subtract the parameter space @p Params from @p Set.
	isl::set subtractParams(isl::set Set, isl::set Params);			isl::set subtractParams(isl::set Set, isl::set Params);

	/// If @p PwAff maps to a constant, return said constant. If @p Max/@p Min, it			/// If @p PwAff maps to a constant, return said constant. If @p Max/@p Min, it
	/// can also be a piecewise constant and it would return the minimum/maximum			/// can also be a piecewise constant and it would return the minimum/maximum
	/// value. Otherwise, return NaN.			/// value. Otherwise, return NaN.
	isl::val getConstant(isl::pw_aff PwAff, bool Max, bool Min);			isl::val getConstant(isl::pw_aff PwAff, bool Max, bool Min);

				/// If the relation @p PwAff lies on a hyperplane where the given
				/// dimension @p Pos with the type @p Dim has a fixed value, then
				/// return that value. Otherwise return NaN.
				isl::val getConstant(isl::map Map, isl::dim Dim, int Pos);

	/// Check that @p End is valid and return an iterator from @p Begin to @p End			/// Check that @p End is valid and return an iterator from @p Begin to @p End
	///			///
	/// Use case example:			/// Use case example:
	/// for (unsigned i : rangeIslSize(0, Map.domain_tuple_dim()))			/// for (unsigned i : rangeIslSize(0, Map.domain_tuple_dim()))
	/// // do stuff			/// // do stuff
	llvm::iota_range<unsigned> rangeIslSize(unsigned Begin, isl::size End);			llvm::iota_range<unsigned> rangeIslSize(unsigned Begin, isl::size End);

	/// Dump a description of the argument to llvm::errs().			/// Dump a description of the argument to llvm::errs().
	▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

polly/lib/Support/ISLTools.cpp

Show First 20 Lines • Show All 257 Lines • ▼ Show 20 Lines	case isl::dim::in:
return Map.apply_domain(TranslatorMap);		return Map.apply_domain(TranslatorMap);
case isl::dim::out:		case isl::dim::out:
return Map.apply_range(TranslatorMap);		return Map.apply_range(TranslatorMap);
default:		default:
llvm_unreachable("Unsupported value for 'dim'");		llvm_unreachable("Unsupported value for 'dim'");
}		}
}		}

		isl::val polly::getConstant(isl::map Map, isl::dim Dim, int Pos) {
		MeinersburUnsubmitted Done Reply Inline Actions nice Meinersbur: nice
		unsigned NumDims = unsignedFromIslSize(Map.dim(Dim));
		if (Pos < 0)
		Pos = NumDims + Pos;
		assert(unsigned(Pos) < NumDims && "Dimension index must be in range");
		// TODO: The isl_map_plain_get_val_if_fixed function is not robust, since its
		// result is different depending on the internal representation.
		// Replace it with a different implementation.
		return isl::manage(isl_map_plain_get_val_if_fixed(
		Map.get(), static_cast<enum isl_dim_type>(Dim), Pos));
		}

isl::union_map polly::shiftDim(isl::union_map UMap, isl::dim Dim, int Pos,		isl::union_map polly::shiftDim(isl::union_map UMap, isl::dim Dim, int Pos,
int Amount) {		int Amount) {
isl::union_map Result = isl::union_map::empty(UMap.ctx());		isl::union_map Result = isl::union_map::empty(UMap.ctx());

for (isl::map Map : UMap.get_map_list()) {		for (isl::map Map : UMap.get_map_list()) {
isl::map Shifted = shiftDim(Map, Dim, Pos, Amount);		isl::map Shifted = shiftDim(Map, Dim, Pos, Amount);
Result = Result.unite(Shifted);		Result = Result.unite(Shifted);
}		}
▲ Show 20 Lines • Show All 635 Lines • Show Last 20 Lines

polly/lib/Transform/MatmulOptimizer.cpp

//===- MatmulOptimizer.cpp -----------------------------------------------===// //===- MatmulOptimizer.cpp -----------------------------------------------===//

// //

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information. // See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

#include "polly/MatmulOptimizer.h" #include "polly/MatmulOptimizer.h"

#include "polly/DependenceInfo.h" #include "polly/DependenceInfo.h"

#include "polly/Options.h" #include "polly/Options.h"

#include "polly/ScheduleTreeTransform.h" #include "polly/ScheduleTreeTransform.h"

#include "polly/ScopInfo.h" #include "polly/ScopInfo.h"

#include "polly/ScopPass.h" #include "polly/ScopPass.h"

#include "polly/Simplify.h" #include "polly/Simplify.h"

#include "polly/Support/GICHelper.h"

#include "polly/Support/ISLTools.h" #include "polly/Support/ISLTools.h"

#include "llvm/ADT/ArrayRef.h" #include "llvm/ADT/ArrayRef.h"

#include "llvm/ADT/DenseSet.h"

#include "llvm/ADT/Optional.h" #include "llvm/ADT/Optional.h"

#include "llvm/ADT/Sequence.h" #include "llvm/ADT/Sequence.h"

#include "llvm/ADT/SetOperations.h"

#include "llvm/ADT/SmallVector.h" #include "llvm/ADT/SmallVector.h"

#include "llvm/ADT/StringRef.h" #include "llvm/ADT/StringRef.h"

#include "llvm/ADT/iterator_range.h" #include "llvm/ADT/iterator_range.h"

#include "llvm/Analysis/TargetTransformInfo.h" #include "llvm/Analysis/TargetTransformInfo.h"

#include "llvm/IR/DataLayout.h" #include "llvm/IR/DataLayout.h"

#include "llvm/IR/Function.h" #include "llvm/IR/Function.h"

#include "llvm/IR/Module.h" #include "llvm/IR/Module.h"

#include "llvm/Support/CommandLine.h" #include "llvm/Support/CommandLine.h"

▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines static cl::opt<int> VectorRegisterBitwidth(

cl::Hidden, cl::init(-1), cl::cat(PollyCategory)); cl::Hidden, cl::init(-1), cl::cat(PollyCategory));

static cl::opt<int> PollyPatternMatchingNcQuotient( static cl::opt<int> PollyPatternMatchingNcQuotient(

"polly-pattern-matching-nc-quotient", "polly-pattern-matching-nc-quotient",

cl::desc("Quotient that is obtained by dividing Nc, the parameter of the" cl::desc("Quotient that is obtained by dividing Nc, the parameter of the"

"macro-kernel, by Nr, the parameter of the micro-kernel"), "macro-kernel, by Nr, the parameter of the micro-kernel"),

cl::Hidden, cl::init(256), cl::cat(PollyCategory)); cl::Hidden, cl::init(256), cl::cat(PollyCategory));

static cl::opt<bool>

PMBasedTCOpts("polly-tc-opt",

cl::desc("Perform optimizations of tensor contractions based "

"on pattern matching"),

cl::init(false), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<bool>

PMBasedMMMOpts("polly-matmul-opt",

cl::desc("Perform optimizations of matrix multiplications "

"based on pattern matching"),

cl::init(true), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<int> OptComputeOut(

"polly-tc-dependences-computeout",

cl::desc("Bound the dependence analysis by a maximal amount of "

"computational steps (0 means no bound)"),

cl::Hidden, cl::init(500000), cl::ZeroOrMore, cl::cat(PollyCategory));

namespace { namespace {

/// Parameters of the micro kernel. /// Parameters of the micro kernel.

/// ///

/// Parameters, which determine sizes of rank-1 (i.e., outer product) update /// Parameters, which determine sizes of rank-1 (i.e., outer product) update

/// used in the optimized matrix multiplication. /// used in the optimized matrix multiplication.

struct MicroKernelParamsTy { struct MicroKernelParamsTy {

int Mr; int Mr;

int Nr; int Nr;

Show All 18 Lines struct MatMulInfoTy {

MemoryAccess *B = nullptr; MemoryAccess *B = nullptr;

MemoryAccess *ReadFromC = nullptr; MemoryAccess *ReadFromC = nullptr;

MemoryAccess *WriteToC = nullptr; MemoryAccess *WriteToC = nullptr;

int i = -1; int i = -1;

int j = -1; int j = -1;

int k = -1; int k = -1;

}; };

/// Parameters of the tensor contraction operands.

///

/// A general d-dimensional tensor T ∈ R ^ Nu0 x ... x Nud−1 can be defined

/// as the set of scalar elements indexed by the set of indices u0 ... ud,

MeinersburUnsubmitted

Done

Please add more details on what the members represent.

Meinersbur: Please add more details on what the members represent.

///

/// T ≡ {Anu0...nud−1 ∈ R | (u0,...,ud−1) ∈ Nu0 x ... x Nud−1}.

///

/// Let A, B, and C be dA, dB, and dC-dimensional tensors, respectively.

/// Let the free and the contracted indices of the tensor A be grouped into

MeinersburUnsubmitted

Done

struct TCInfoTy {

- // Memory accesses that represent reading from tensors, which are operands of

- // the tensor contraction.

+ /// @{

+ /// Memory accesses that represent reading from tensors, which are operands of

+ /// the tensor contraction.

MemoryAccess *A = nullptr;

MemoryAccess *B = nullptr;

+ /// @}

// Memory accesses that represent reading from and writing into the tensor,

Please use doxygen comments to describe class/struct members. @{ @} can be used to group them.

Also add an empty line before a new comment begins.

Meinersbur: Please use doxygen comments to describe class/struct members. `@{ @}` can be used to [[ https…

/// two bundles I = i0...ir−1 and P = p0...pt−1, respectively. Similarly,

/// the free and the contracted indices of B are grouped into bundles

/// J = j0..js−1 and P and the free indices of C are grouped into

MeinersburUnsubmitted

Done

std::set is a high-overhead implementation. Consider using DenseSet or SmallDenseSet. See https://www.llvm.org/docs/ProgrammersManual.html#llvm-adt-denseset-h

Meinersbur: `std::set` is a high-overhead implementation. Consider using `DenseSet` or `SmallDenseSet`. See…

/// bundles I and J.

///

/// Tensor contraction (TC) of tensors A, B into tensor C can be represented as

/// C(shuffle(I,J))=∑α·A(shuffle(I,P))·B(shuffle(P,J))+β·C(shuffle(I,J)),

MeinersburUnsubmitted

Done

AFAIU multiplication by β is not part of this detection, but required to be loop-distributed by the isl scheduler.

Meinersbur: AFAIU multiplication by β is not part of this detection, but required to be loop-distributed by…

gareevromanAuthorUnsubmitted

Done

Yes, it's not. I've added a comment about this.

gareevroman: Yes, it's not. I've added a comment about this.

/// where ∑ is a summation over all contracted indices of P,

/// α, β ∈ R, Npi is the length of the tensor dimension that corresponds

/// to the index pi, A(shuffle(I, P)), B(shuffle(P, J)), C(shuffle(I, J)) are

MeinersburUnsubmitted

Done

Is there an argument to use 30 and small size? If not, consider using just SmallVector<int>.

Meinersbur: Is there an argument to use 30 and small size? If not, consider using just `SmallVector<int>`.

/// accesses to tensors A, B, C, respectively,

/// shuffle(I, J), shuffle(I, P), and shuffle(P, J) are permutations of

/// the enclosed indices.

///

/// Multiplication of C(shuffle(I,J)) by β can be moved into a different SCoP

/// statement by loop distribution, which is done by the isl scheduler.

// If β is not equal to one, the optimization of TC of Polly requires

/// such a transformation.

MeinersburUnsubmitted

Done

@{ is not needed when documenting just a single member.

Meinersbur: `@{` is not needed when documenting just a single member.

///

/// TCInfoTy contains parameters, which describe access relations that represent

/// operands of the tensor contraction.

struct TCInfoTy {

/// @{

/// Memory accesses that represent reading from tensors, which are operands of

/// the tensor contraction.

MemoryAccess *A = nullptr;

MemoryAccess *B = nullptr;

/// @}

/// @{

/// Memory accesses that represent reading from and writing into the tensor,

/// which contains the result of the tensor contraction.

MemoryAccess *ReadFromC = nullptr;

MemoryAccess *WriteToC = nullptr;

/// @}

/// @{

/// Input dimensions of the schedule space, which represent free

/// indices of tensors.

SmallDenseSet<int> I;

SmallDenseSet<int> J;

/// @}

/// Input dimension of the schedule space, which represents contracted

/// indices of tensors.

SmallDenseSet<int> P;

/// @{

/// Sizes of tensor dimensions for corresponding input dimensions of

/// the schedule space. The size of the tensor dimension can be larger than

/// the size of the corresponding input dimension of the schedule space.

/// This does not correspond to a tensor contraction. However, such a pattern

/// will be optimized by the transformation.

SmallVector<int> DimensionSizes;

SmallVector<int> ADimensions;

SmallVector<int> BDimensions;

SmallVector<int> CDimensions;

/// @}

/// @{

/// Permutations of indices of I, J, and P, which describe operands of

/// the tensor contraction and its result.

SmallVector<int> OrderedI;

SmallVector<int> OrderedJ;

SmallVector<int> OrderedP;

/// @}

};

/// Create an isl::union_set, which describes the option of the form /// Create an isl::union_set, which describes the option of the form

/// [isolate[] -> unroll[x]]. /// [isolate[] -> unroll[x]].

/// ///

/// @param Ctx An isl::ctx, which is used to create the isl::union_set. /// @param Ctx An isl::ctx, which is used to create the isl::union_set.

static isl::union_set getUnrollIsolatedSetOptions(isl::ctx Ctx) { static isl::union_set getUnrollIsolatedSetOptions(isl::ctx Ctx) {

isl::space Space = isl::space(Ctx, 0, 0, 1); isl::space Space = isl::space(Ctx, 0, 0, 1);

isl::map UnrollIsolatedSetOption = isl::map::universe(Space); isl::map UnrollIsolatedSetOption = isl::map::universe(Space);

isl::id DimInId = isl::id::alloc(Ctx, "isolate", nullptr); isl::id DimInId = isl::id::alloc(Ctx, "isolate", nullptr);

▲ Show 20 Lines • Show All 831 Lines • ▼ Show 20 Lines

/// ///

/// @param Node The node to check. /// @param Node The node to check.

/// @param D The SCoP dependencies. /// @param D The SCoP dependencies.

/// @param MMI Parameters of the matrix multiplication operands. /// @param MMI Parameters of the matrix multiplication operands.

static bool isMatrMultPattern(isl::schedule_node Node, const Dependences *D, static bool isMatrMultPattern(isl::schedule_node Node, const Dependences *D,

MatMulInfoTy &MMI) { MatMulInfoTy &MMI) {

auto PartialSchedule = isl::manage( auto PartialSchedule = isl::manage(

isl_schedule_node_band_get_partial_schedule_union_map(Node.get())); isl_schedule_node_band_get_partial_schedule_union_map(Node.get()));

Node = Node.child(0); if (isl_schedule_node_band_n_member(Node.get()) < 3 ||

auto LeafType = isl_schedule_node_get_type(Node.get());

Node = Node.parent();

if (LeafType != isl_schedule_node_leaf ||

isl_schedule_node_band_n_member(Node.get()) < 3 ||

Node.get_schedule_depth().release() != 0 || Node.get_schedule_depth().release() != 0 ||

isl_union_map_n_map(PartialSchedule.get()) != 1) isl_union_map_n_map(PartialSchedule.get()) != 1)

return false; return false;

auto NewPartialSchedule = isl::map::from_union_map(PartialSchedule); auto NewPartialSchedule = isl::map::from_union_map(PartialSchedule);

if (containsMatrMult(NewPartialSchedule, D, MMI)) if (containsMatrMult(NewPartialSchedule, D, MMI))

return true; return true;

return false; return false;

} }

/// Get the dimension size.

///

/// Return the size of the dimension @p Pos, which is obtained from @p SAI.

/// Return -1 in the case of the first dimension of a multi-dimensional array,

/// since the ScopArrayInfo class does not carry size information.

///

/// @param SAI The information about the array.

/// @param Pos The position of the dimension.

/// @return The size of the dimension.

static int getDimSize(const ScopArrayInfo *SAI, unsigned Pos) {

if (Pos == 0)

return -1;

const llvm::SCEV *SCEVDimSize = SAI->getDimensionSize(Pos);

assert(SCEVDimSize);

MeinersburUnsubmitted

Done

const llvm::SCEV *SCEVDimSize = SAI->getDimensionSize(Pos);

- assert(SCEVDimSize && L"Prevent the undefined behavior");

+ assert(SCEVDimSize && "Prevent the undefined behavior");

auto *ConstantDimSize = dyn_cast<const SCEVConstant>(SCEVDimSize);

[style] No reason to make this a wide string literal, especially if just used as an assertion failed message.

Apples to other occurrences as well.

Meinersbur: [style] No reason to make this a wide string literal, especially if just used as an assertion…

auto *ConstantDimSize = dyn_cast<const SCEVConstant>(SCEVDimSize);

assert(ConstantDimSize);

auto *IntDimSize = dyn_cast<ConstantInt>(ConstantDimSize->getValue());

assert(IntDimSize);

return IntDimSize->getSExtValue();

}

/// Check whether the access relation has the specified form.

///

/// Check that the access relation @p AccMap has the form T[I0, …, In], where

/// indexes I0, …, In are specified by @p Dimensions.

///

/// @param Domain The domain of the access relation.

/// @param AccMap The access relation to be checked.

/// @param Dimensions The permutation of the subset of the input dimensions.

/// @return True if @p AccMap has the expected form and false,

/// otherwise.

static bool isCorrectAccessMap(isl::set Domain, isl::map AccMap,

ArrayRef<int> Dimensions) {

MeinersburUnsubmitted

Done

static bool isCorrectAccessMap(isl::set Domain, isl::map AccMap,

- const SmallVector<int, 30> &Dimensions) {

+ ArrayRef<int> Dimensions) {

isl::space Space = AccMap.get_space();

Meinersbur:

isl::space Space = AccMap.get_space();

if (unsignedFromIslSize(Space.dim(isl::dim::out)) != Dimensions.size())

return false;

// Create an access relation of the following form:

// [I0, …, Im] -> [Il, …, In], where indexes

MeinersburUnsubmitted

Done

isl::map PossibleTensor = isl::manage(Universe.copy());

- for (int i = 0; i < static_cast<int>(Dimensions.size()); i++) {

+ for (unsigned i = 0; i < unsignedFromIslSize(Dimensions.size()); i++) {

const int InPos = Dimensions[i];

or introduce intFromIslSize.

Meinersbur: or introduce `intFromIslSize`.

gareevromanAuthorUnsubmitted

Done

As far I understand, Dimensions.size() returns a value of type size_t instead of a value of the type isl_size. So, in the new version I used the unsigned type to avoid the cast.

gareevroman: As far I understand, Dimensions.size() returns a value of type size_t instead of a value of the…

MeinersburUnsubmitted

Not Done

rangeIslSize should make it easier.

Meinersbur: `rangeIslSize` should make it easier.

gareevromanAuthorUnsubmitted

Done

I think rangeIslSize can’t be used in this case. However, I’ve tried to use rangeIslSize to improve the patch.

gareevroman: I think rangeIslSize can’t be used in this case. However, I’ve tried to use rangeIslSize to…

// Il, …, In are specified by @p Dimensions.

isl::map PossibleTensor = isl::map::universe(Space);

unsigned DimInSize = unsignedFromIslSize(Space.dim(isl::dim::in));

for (unsigned i = 0; i < Dimensions.size(); i++) {

const int InPos = Dimensions[i];

if ((InPos >= static_cast<int>(DimInSize)) || (InPos < 0))

return false;

PossibleTensor =

PossibleTensor.equate(isl::dim::in, InPos, isl::dim::out, i);

}

AccMap = AccMap.intersect_domain(Domain);

PossibleTensor = PossibleTensor.intersect_domain(Domain);

// If AccMap != PossibleTensor here (the two maps have been gisted at

// this point), it means that the writes are not complete, or in other

// words, it is a Partial write and Partial writes must be rejected.

return AccMap.is_equal(PossibleTensor);

}

MeinersburUnsubmitted

Not Done

I like the idea of verifying the correctness by reconstructing and comparing to the original.

Maybe do it at the end to verify that the entire TCInfoTy is correct? On the other size, earlier fail would be better. What do you think?

Meinersbur: I like the idea of verifying the correctness by reconstructing and comparing to the original.

gareevromanAuthorUnsubmitted

Done

Other parts of TCInfoTy are verified in isTCOperandAcc too. I think that it would be better to verify the related information in one place as much and as early as possible.

Probably, the earlier fail would simplify the debugging, since we exactly know the form of memory accesses and can rely on it. Additionally, the performance can be improved, since the earlier fail helps to avoid additional operations with sets.

gareevroman: Other parts of TCInfoTy are verified in isTCOperandAcc too. I think that it would be better to…

/// Check whether the access represents the tensor contraction operand.

///

/// Check that the access relation @p AccMap has the form T[i1, …, in].

/// Obtained indexes i1, …, in, their sizes and their permutation are stored

/// into @p IndexSet, @p DimensionSizes, and @p Dimensions, respectively.

MeinersburUnsubmitted

Done

/// Obtained indexes i1, …, in, their sizes and their permutation are stored

- /// into @p IndexSet, @p DimensionSizes, and @p Dimensions, respectively.

+ /// into @p IndexSet, @p DimensionSizes, and @p Dimensions, respectively.

///

/// @param Domain The domain of the access relation.

Meinersbur:

///

/// @param Domain The domain of the access relation.

/// @param AccMap The access relation to be checked.

/// @param IndexSet The subset of the input dimensions.

/// @param DimensionSizes Sizes of the input dimensions of @p Dimensions.

/// @param Dimensions The permutation of the subset of the input dimensions.

/// @return True if @p AccMap has the expected form and false,

/// otherwise.

static bool isTCOperandAcc(isl::set Domain, isl::map AccMap,

MeinersburUnsubmitted

Done

std::set<int> &IndexSet,

- SmallVector<int, 30> &DimensionSizes,

- SmallVector<int, 30> &Dimensions) {

+ SmallVectorImpl<int> &DimensionSizes,

+ SmallVectorImpl<int> &Dimensions) {

isl::id Id = AccMap.get_tuple_id(isl::dim::out);

SmallVectorImpl is not specific to what the vector's small size is.

Meinersbur: `SmallVectorImpl` is not specific to what the vector's small size is.

SmallDenseSet<int> &IndexSet,

SmallVectorImpl<int> &DimensionSizes,

SmallVectorImpl<int> &Dimensions) {

isl::id Id = AccMap.get_tuple_id(isl::dim::out);

const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(Id);

assert(SAI && "AccMap should represent memory access");

// Fix values of output dimensions with respect to their positions.

// In the case of the tensor contraction, values of output dimensions are

// fixed and form a permutation of a subset of values of input dimensions.

MeinersburUnsubmitted

Done

unsigned InDimNum = unsignedFromIslSize(CheckMap.dim(isl::dim::in));

- for (unsigned i = 0; i < InDimNum; i++) {

+ for (unsigned i : rangeIslSize(0, CheckMap.dim(isl::dim::in)))

isl::val Val = isl::manage(

Consider using polly::rangeIslSize for iterating over dimensions.

Meinersbur: Consider using `polly::rangeIslSize` for iterating over dimensions.

// For example, in the case of Stmt[i][j][k] -> A[k][i], which represents

// the operand of the tensor contraction, we get the following map by fixing

// the output dimensions Stmt[1][j][0] -> A[0][1].

MeinersburUnsubmitted

Done

This is a weird way to find out which indices map to what other index, I guess the equivalent of isMatMulOperandAcc. It requires that the dimension number if part of the AccMap's range, and if there is any expression will fail (eg Stmt[i][j] -> A[i-1][j+1] matches the wrong dimensions), but at least there is the verification afterwards.

I am not sure I like this sort of cleverness; I'd rather expect some sort of introspection into the map's coefficients, but I also think this should work in nearly all relevant cases and should be save due to the verification, so lets keep it.

However, please document this better, eg. add an example on what is expected to happen.

Meinersbur: This is a weird way to find out which indices map to what other index, I guess the equivalent…

// We store the permutation of the subset of the input dimensions {2, 0} into

// @p Dimensions.

// The obtained permutation and the isCorrectAccessMap function are used to

// check whether the access relation @p AccMap represents the tensor

// contraction operand. For example, in the case of

// Stmt[i][j][k] -> A[i-1][j+1], we get Stmt[1][0][k] -> A[0][1] and,

MeinersburUnsubmitted

Done

The "plain" function are unfortunately not very robust, eg its result is different depending on the internal representation. I'd suggestion getConstant (from ISLTools) but only takes an pw_aff.

Could you extract uses of plain_get_val_if_fixed into such a function, and mark it as TODO to cope with it later?

Meinersbur: The "plain" function are unfortunately not very robust, eg its result is different depending on…

// consequently, {1, 0}, which is rejected by isCorrectAccessMap,

// since it corresponds to Stmt[i][j][k] -> A[j][i].

isl::map CheckMap = isl::manage(AccMap.copy());

unsigned OutDimNum = unsignedFromIslSize(CheckMap.dim(isl::dim::out));

for (unsigned i = 0; i < OutDimNum; i++)

CheckMap = CheckMap.fix_si(isl::dim::out, i, i);

// Try to obtain the permutation and sizes of corresponding input dimensions.

Dimensions.assign(OutDimNum, -1);

for (unsigned i : rangeIslSize(0, CheckMap.dim(isl::dim::in))) {

isl::val Val = getConstant(CheckMap, isl::dim::in, i);

if (!Val.is_int())

continue;

int OutPos = -1;

llvm::APInt ValAPInt = APIntFromVal(Val);

MeinersburUnsubmitted

Done

/// @return The set intersection.

- std::set<int> intersect(const std::set<int> &A, const std::set<int> &B) {

+ static std::set<int> intersect(const std::set<int> &A, const std::set<int> &B) {

std::set<int> Intersection;

Although already in an anon namespace, the other methods add static as well. I found it helps the compiler to warn if a static function is unnused.

Meinersbur: Although already in an anon namespace, the other methods add `static` as well. I found it helps…

if (ValAPInt.isSignedIntN(32))

OutPos = ValAPInt.getSExtValue();

if ((OutPos < 0) || (OutPos >= static_cast<int>(OutDimNum)) ||

MeinersburUnsubmitted

Done

Do you know of #include <llvm/ADT/SetOperations.h>? Unfortunately, these modify one set rather than returning a new set.

Meinersbur: Do you know of `#include <llvm/ADT/SetOperations.h>`? Unfortunately, these modify one set…

IndexSet.count(i))

return false;

IndexSet.insert(i);

Dimensions[OutPos] = i;

if (DimensionSizes[i] <= 0)

DimensionSizes[i] = getDimSize(SAI, OutPos);

}

return isCorrectAccessMap(Domain, AccMap, Dimensions);

}

/// Find the intersection of two sets.

///

/// Find the intersection of the set @p A and the set @p B.

///

/// @param A, B Sets to intersect.

/// @return The set intersection.

static SmallDenseSet<int> intersect(const SmallDenseSet<int> &A,

const SmallDenseSet<int> &B) {

SmallDenseSet<int> Intersection = A;

set_intersect(Intersection, B);

return Intersection;

}

/// Check whether the set is a superset.

///

/// Check that the set @p A is a superset of @p B.

///

/// @param A, B Sets to be checked.

MeinersburUnsubmitted

Done

auto *MemA = Accesses.end() - 1;

- for (; MemA != Accesses.begin(); MemA--) {

+ for (MemoryAccess *MemA : reverse(Accesses)) {

MemoryAccess *MemAccessPtr = *MemA;

Meinersbur:

MeinersburUnsubmitted

Not Done

I can use JScopImport to set a scalar memory access to a partial write without adding additional dependencies; that is, I don't think this can be just ignored.

I suggest to have a single function that calls getAccessesInOrder and sort out which MemAccess is read/write in there, then analyze them.

Meinersbur: I can use JScopImport to set a scalar memory access to a partial write without adding…

gareevromanAuthorUnsubmitted

Done

I am not sure whether modifications of implementations of tensor contractions, which contain read and write scalar memory accesses, are useful in practice.

Moreover, since bundles of induction variables I, J, P can contain an unlimited number of dimensions, we possibly cannot follow the algorithm from the containsOnlyMatrMultAcc function, which permutes dimensions and checks that additional memory accesses have stride 0 in terms of dimensions MMM.i, MMM.j, and MMM.k. Consequently, such memory accesses can be treated as scalar memory accesses. I have not come up with an effective alternative yet.

That is why I do not consider scalar memory accesses in getWriteAccess and setReadAccesses functions. Could we mark it as TODO and do it future?

gareevroman: I am not sure whether modifications of implementations of tensor contractions, which contain…

MeinersburUnsubmitted

Not Done

The concern is that I can modify what isLatestArrayKind() returns by simply importing a JScop. The continue just ignores such weirdness but I think it is safer to fail in this case.

You yourself mention that scalar accesses are likely not useful, so why not fail when one is found instead (return null instead of continue)? Some exceptions may be possible, such as read-only scalars (VirtualUse::ReadOnly, VirtualUse::Synthesizable)

Meinersbur: The concern is that I can modify what `isLatestArrayKind()` returns by simply importing a JScop.

gareevromanAuthorUnsubmitted

Not Done

I've added such a return statement to avoid scalar write memory accesses. Sorry, I was wrong. We need scalar read memory accesses. For example, in the case of the following matrix-matrix multiplication, a SCoP statement, which represents the body of the loop, contains the constant alpha.

C = alpha*A*B

Could we accept non-partial scalar read memory accesses? I think this is legal.

gareevroman: I've added such a return statement to avoid scalar write memory accesses. Sorry, I was wrong.

/// @return True if the set A is a superset of B.

static bool isSuperset(const SmallDenseSet<int> &A,

const SmallDenseSet<int> &B) {

return intersect(A, B).size() == B.size();

}

/// Find the union of two sets.

///

/// Find the union of the set @p A and the set @p B.

///

/// @param A, B Sets to unite.

MeinersburUnsubmitted

Done

This computes whether two sets a disjoint, it should not be required to compute the intersection.

Meinersbur: This computes whether two sets a disjoint, it should not be required to compute the…

gareevromanAuthorUnsubmitted

Done

That check is redundant. Thanks.

gareevroman: That check is redundant. Thanks.

/// @return The set union.

static SmallDenseSet<int> unite(const SmallDenseSet<int> &A,

const SmallDenseSet<int> &B) {

SmallDenseSet<int> Union = A;

set_union(Union, B);

return Union;

}

/// Determine the access that writes to the tensor, which contains

/// the result of the tensor contraction.

///

/// @param Domain The domain of the statement.

/// @param Stmt The statement, which writes to memory.

/// @param TCI The information about the tensor contraction.

/// @param IandJIndexSet The set, which contains free indexes of tensors.

/// @return The determined MemoryAccess, or nullptr if there is no necessary

MeinersburUnsubmitted

Done

const std::set<int> &IandJIndexSet,

- SmallVector<int, 30> Dimensions, TCInfoTy &TCI) {

+ ArrayRef<int> Dimensions, TCInfoTy &TCI) {

if (!TCI.A) {

Meinersbur:

gareevromanAuthorUnsubmitted

Done

As far as I understand, we cannot do this here because of the assignment to TCI.ADimensions and TCI.BDimensions

gareevroman: As far as I understand, we cannot do this here because of the assignment to TCI.ADimensions and…

MeinersburUnsubmitted

Not Done

The assignments should just make a copy of the array . With Dimensions being passed by-value, the caller has to make the copy which it should not need to.

SmallVector has no overload for being assigned an ArrayRef, but you could use llvm::append_range to insert all the values.

Meinersbur: The assignments should just make a copy of the array . With `Dimensions` being passed by-value…

gareevromanAuthorUnsubmitted

Done

I think we can use llvm::replace to avoid clearing the vector and preserve the logic.

gareevroman: I think we can use llvm::replace to avoid clearing the vector and preserve the logic.

/// access within the SCoP.

static MemoryAccess *getWriteAccess(isl::set Domain, ScopStmt *Stmt,

TCInfoTy &TCI,

SmallDenseSet<int> &IandJIndexSet) {

TCI.WriteToC = nullptr;

SmallVector<MemoryAccess *, 32> Accesses = getAccessesInOrder(*Stmt);

MeinersburUnsubmitted

Done

Introduce a is_superset (etc) call?

Meinersbur: Introduce a `is_superset` (etc) call?

for (MemoryAccess *MemA : reverse(Accesses)) {

// A TC-like does not contain write scalar memory accesses

if (!MemA->isLatestArrayKind())

return nullptr;

// The last memory access should be a write memory access.

if (!MemA->isWrite())

return nullptr;

isl::map AccMap = MemA->getLatestAccessRelation();

if (!isTCOperandAcc(Domain, AccMap, IandJIndexSet, TCI.DimensionSizes,

TCI.CDimensions))

return nullptr;

return MemA;

}

return nullptr;

}

/// Determine an access, which reads elements of an operand of the tensor

/// contraction

MeinersburUnsubmitted

Done

Could we add utility functions such that this becomes unite(J, P) == IndexSet?

Meinersbur: Could we add utility functions such that this becomes `unite(J, P) == IndexSet`?

///

/// @param MemAccessPtr The access, which reads elements of the tensor.

/// @param IndexSet The set, which contains indexes of the tensors.

/// @param IandJIndexSet The set, which contains free indexes of tensors.

/// @param Dimensions The permutation of the subset of the input dimensions.

/// @param TCI The information about the tensor contraction.

/// @return True if the memory access @p MemAccessPtr corresponds

/// to the tensor contraction.

static bool setReadAccess(MemoryAccess *MemAccessPtr,

const SmallDenseSet<int> &IndexSet,

const SmallDenseSet<int> &IandJIndexSet,

ArrayRef<int> Dimensions, TCInfoTy &TCI) {

MeinersburUnsubmitted

Done

getAccessesInOrder requires Stmt to not be a RegionStmt. Please add a test for it.

Meinersbur: `getAccessesInOrder` requires `Stmt` to not be a RegionStmt. Please add a test for it.

gareevromanAuthorUnsubmitted

Done

I’ve added a check to containsOnlyTCAcc. Could you clarify how the test case should look like? Should it be a region statement that contains a matrix multiplication with right order of memory accesses?

gareevroman: I’ve added a check to containsOnlyTCAcc. Could you clarify how the test case should look like?

MeinersburUnsubmitted

Not Done

Test in containsOnlyTCAcc is exactly what I was looking for. A region statement could look like this:

c = C[i][j];
if (/*non-affine condition*/) {
  (void)A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}

which has the correct order of accesses but is obviously not what we are looking for.

Meinersbur: Test in `containsOnlyTCAcc` is exactly what I was looking for. A region statement could look…

gareevromanAuthorUnsubmitted

Done

Thanks for the example! I have added a corresponding test case. If I am not mistaken, it requires DeLICM.

gareevroman: Thanks for the example! I have added a corresponding test case. If I am not mistaken, it…

MeinersburUnsubmitted

Not Done

It does not require DeLICM, but -polly-allow-nonaffine-branches (which is enabled by default)

Meinersbur: It does not require DeLICM, but `-polly-allow-nonaffine-branches` (which is enabled by default)

gareevromanAuthorUnsubmitted

Not Done

If I'm not mistaken, in your example the form of the dependencies doesn't correspond to the pattern.

c = C[i][j];
if (/*non-affine condition*/) {
  A[i][k] + B[k][j];
} else {
  C[i][j] = c;
}

MayWrite: { Stmt_for_body8TOfor_inc[i0, i1, i2] -> MemRef_C[i0, i1] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }

I've added a slightly modified version of it to polly/test/ScheduleOptimizer/pattern-matching-based-opts_23.ll. It produces a region statement too.

for (int i = 0; i < 32; i++)
  for (int j = 0; j < 32; j++)
    for (int k = 0; k < 32; k++) {
      int c = C[i][j];
      if (i*j*k < 10) {
        C[i][j] = A[i][k] + B[k][j];
      } else {
        C[i][j] = c;
      } 
}

However, it introduces store merge phi nodes. It makes DeLICM necessary.

Statements {
	Stmt_for_body8__TO__if_end
        Domain :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> [i0, i1, i2, 0] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_A[i0, i2] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_B[i2, i1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_for_body8__TO__if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
	Stmt_if_end
        Domain :=
            { Stmt_if_end[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 };
        Schedule :=
            { Stmt_if_end[i0, i1, i2] -> [i0, i1, i2, 1] };
        ReadAccess :=	[Reduction Type: NONE] [Scalar: 1]
            { Stmt_if_end[i0, i1, i2] -> MemRef_storemerge__phi[] };
       new: { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };
        MustWriteAccess :=	[Reduction Type: NONE] [Scalar: 0]
            { Stmt_if_end[i0, i1, i2] -> MemRef_C[i0, i1] };

gareevroman: If I'm not mistaken, in your example the form of the dependencies doesn't correspond to the…

MeinersburUnsubmitted

Not Done

The storemerge PHI node is introduced by one of the canonicalization passes (in this case: InstCombine). It it possible to not run that pass, disable the matching of this particular pattern, or use some other trick to not be matched. In any case, we cannot rely on the InstCombine to happen. It might be safer to bail out if any RegionStmt is encountered.

Meinersbur: The `storemerge` PHI node is introduced by one of the canonicalization passes (in this case…

gareevromanAuthorUnsubmitted

Done

I see. I haven't managed to fix that test case. So, I've decided to remove it. I've left the check that bails out if any RegionStmt is encountered in containsOnlyTCAcc.

gareevroman: I see. I haven't managed to fix that test case. So, I've decided to remove it. I've left the…

if (!TCI.A) {

// Probably IndexSet is a union of I and P sets.

if (!isSuperset(IndexSet, TCI.P))

return false;

// Obtain the set I.

TCI.I = set_difference(IndexSet, TCI.P);

MeinersburUnsubmitted

Done

Compiler warning:

/home/meinersbur/src/llvm-project/polly/lib/Transform/MatmulOptimizer.cpp:1310:13: warning: moving a temporary object prevents copy elision [-Wpessimizing-move]
    TCI.I = std::move(set_difference(IndexSet, TCI.P));

The result of set_difference is already an r-value, no need to cast it to an r-value.

Meinersbur: Compiler warning: ``` /home/meinersbur/src/llvm-project/polly/lib/Transform/MatmulOptimizer.cpp…

if (!isSuperset(IandJIndexSet, TCI.I))

return false;

MeinersburUnsubmitted

Done

If any of the returns are executed, what causes the pattern to be rejected (it's not return false)?

Meinersbur: If any of the returns are executed, what causes the pattern to be rejected (it's not `return…

// Obtain the set J.

TCI.J = set_difference(IandJIndexSet, TCI.I);

MeinersburUnsubmitted

Done

Same compiler warning.

Meinersbur: Same compiler warning.

// Set the first operand of the tensor contraction.

TCI.A = MemAccessPtr;

llvm::replace(TCI.ADimensions, TCI.ADimensions.begin(),

TCI.ADimensions.end(), Dimensions.begin(), Dimensions.end());

return true;

}

if (!TCI.B) {

// IndexSet should be a union of J and P sets.

if (unite(TCI.P, TCI.J) != IndexSet)

return false;

// Set the second operand of the tensor contraction.

TCI.B = MemAccessPtr;

llvm::replace(TCI.BDimensions, TCI.BDimensions.begin(),

TCI.BDimensions.end(), Dimensions.begin(), Dimensions.end());

return true;

}

return false;

}

/// Check that all memory accesses of the statement, except from the last

/// one, are read memory accesses, which read elements of operands of the tensor

/// contraction and its result.

///

/// @param Domain The domain of the statement.

/// @param Stmt The statement, which writes to memory.

/// @param TCI The information about the tensor contraction.

/// @param IandJIndexSet The set, which contains free indexes of tensors.

/// @return True if all read memory accesses of the statement @p Stmt correspond

/// to the tensor contraction.

static bool setReadAccesses(isl::set Domain, ScopStmt *Stmt, TCInfoTy &TCI,

SmallDenseSet<int> &IandJIndexSet) {

TCI.A = nullptr;

TCI.B = nullptr;

TCI.ReadFromC = nullptr;

SmallVector<MemoryAccess *, 32> Accesses = getAccessesInOrder(*Stmt);

for (auto *MemA = Accesses.begin(); *MemA != TCI.WriteToC; MemA++) {

MemoryAccess *MemAccessPtr = *MemA;

// All memory accesses, except from the last one, should be read memory

// accesses.

if (MemAccessPtr->isWrite())

return false;

isl::map AccMap = MemAccessPtr->getLatestAccessRelation();

if (!MemAccessPtr->isLatestArrayKind()) {

// Check whether the scalar read memory access is not partial.

if (!Domain.is_subset(AccMap.domain()))

return false;

continue;

return false;

}

// There is only one memory access, which reads elements of the result of

// the tensor contraction.

if (AccMap.is_equal(TCI.WriteToC->getLatestAccessRelation())) {

if (TCI.ReadFromC)

return false;

TCI.ReadFromC = MemAccessPtr;

continue;

}

SmallVector<int> Dimensions;

SmallDenseSet<int> IndexSet;

MeinersburUnsubmitted

Done

/// and false, otherwise.

static bool isTcDep(isl::set DepDelta, unsigned Pos, isl::set BoundDeltas,

- std::set<int> *IndexSet) {

+ const std::set<int> &IndexSet) {

if ((unsignedFromIslSize(DepDelta.n_basic_set()) != 1) ||

Meinersbur:

if (!isTCOperandAcc(Domain, AccMap, IndexSet, TCI.DimensionSizes,

Dimensions))

return false;

MeinersburUnsubmitted

Done

The check should not depend on n_basic_set, which is fragile and depends on whether on eg. coalesce was successful. Consider using something like polly::getConstant.

Meinersbur: The check should not depend on `n_basic_set`, which is fragile and depends on whether on eg.

gareevromanAuthorUnsubmitted

Done

I think that this check was redundant. I’ve removed it.

gareevroman: I think that this check was redundant. I’ve removed it.

if (!setReadAccess(MemAccessPtr, IndexSet, IandJIndexSet, Dimensions, TCI))

return false;

}

// Check that there are read memory accesses, which read elements of operands

// of the tensor contraction and its result.

MeinersburUnsubmitted

Done

return true;

}

- /// Check that dependency corresponds to the tensor contraction.

+ /// Check that dependency corresponds to the tensor contraction carried over loop dimension @p Pos.

///

/// Check that the dependency has the form

Meinersbur:

return TCI.ReadFromC && TCI.A && TCI.B;

}

/// Check accesses to operands of the tensor contraction.

///

/// Check that accesses of the SCoP statement, which corresponds to

/// the partial schedule @p PartialSchedule, represent accesses

/// to the non-scalar operands of the tensor contraction.

///

/// @param Domain The domain of the SCoP statement.

/// @param PartialSchedule The partial schedule of the SCoP statement.

/// @param TCI Parameters of the tensor contraction operands.

/// @return True if the corresponding SCoP statement

/// represents tensor contraction and false,

/// otherwise.

static bool containsOnlyTCAcc(isl::set Domain, isl::map PartialSchedule,

TCInfoTy &TCI) {

isl::id InputDimsId = PartialSchedule.get_tuple_id(isl::dim::in);

ScopStmt *Stmt = static_cast<ScopStmt *>(InputDimsId.get_user());

// In region statements, the order of memory accesses execution is not

MeinersburUnsubmitted

Done

This seems to check setermine whether there is a reduction (contraction) carried by loop number Pos. The function name could be more meaningful. Suggestion: isDepCarryingReductionOverDim (not nice, but "TcDep" can mean anything)

Meinersbur: This seems to check setermine whether there is a reduction (contraction) carried by loop number…

gareevromanAuthorUnsubmitted

Done

Could we name it isReductionCarriedOverDim? I think, in this case, we should rename the parameter Pos to Dim to make it more readable.

gareevroman: Could we name it isReductionCarriedOverDim? I think, in this case, we should rename the…

MeinersburUnsubmitted

Done

Sounds ok.

Meinersbur: Sounds ok.

// predictable at compile-time.

MeinersburUnsubmitted

Done

static bool isTcDep(isl::set DepDelta, unsigned Pos, isl::set BoundDeltas,

- SmallDenseSet<int> *IndexSet) {

+ const SmallDenseSet<int> &IndexSet) {

// Check the difference between the image element and the domain element

Consider passing by const-reference.

Meinersbur: Consider passing by const-reference.

if ((Stmt->size() <= 1) || Stmt->isRegionStmt())

return false;

MeinersburUnsubmitted

Done

plain_get_val_if_fixed is not really robust as it depends on the internal representation that can be different after eg simplify.

Since this just checks to a specific value, the best would be to create a new set where all the fixed dimensions are that value (here: 1), and check whether DepDelta is a subset of it.

Meinersbur: `plain_get_val_if_fixed` is not really robust as it depends on the internal representation that…

unsigned DimNum = unsignedFromIslSize(PartialSchedule.dim(isl::dim::in));

TCI.DimensionSizes.resize(DimNum);

SmallDenseSet<int> IandJIndexSet;

TCI.WriteToC = getWriteAccess(Domain, Stmt, TCI, IandJIndexSet);

if (!TCI.WriteToC)

return false;

if (intersect(IandJIndexSet, TCI.P).size() != 0)

return false;

if (!setReadAccesses(Domain, Stmt, TCI, IandJIndexSet))

return false;

MeinersburUnsubmitted

Done

Consider lexmin_pw_multi_aff/lexmax_pw_multi_aff

Meinersbur: Consider `lexmin_pw_multi_aff`/`lexmax_pw_multi_aff`

MeinersburUnsubmitted

Not Done

Why not BoundDeltas.subtract() instead of deltas?

Meinersbur: Why not `BoundDeltas.subtract()` instead of deltas?

gareevromanAuthorUnsubmitted

Done

As far as I understand, these operations are not equal.

deltas computes a set containing the differences between image elements and corresponding domain elements in the input. subtract computes a subtraction of sets.

For example, in the case of the following sets they compute the following:

BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31] }
isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, 0, 0, -1]}

BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
deltas: {Stmt_for_body15[31, 31, 31, 31, 31, 32]}

BoundDeltas : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
isl::manage(isl_set_neg(DepDelta.copy())): {Stmt_for_body15[0, 0, 0, -1, 0, 31]}

BoundDeltas.subtract(isl::manage(isl_set_neg(DepDelta.copy()))) : {Stmt_for_body15[31, 31, 31, 31, 31, 31]}
deltas: {Stmt_for_body15[31, 31, 31, 32, 31, 0]}

These comment interferes with the comment about pw_multi_aff. Consequently, I replaced the usage of isl_map_deltas with operations on pw_multi_aff.

gareevroman: As far as I understand, these operations are not equal. deltas computes a set containing the…

return true;

}

/// Check that dependency corresponds to the tensor contraction carried over

/// loop dimension @p Dim.

///

/// Check that the dependency has the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) ->

/// S(..., ki + 1, min(k(i + 1)), ..., min(kn), ...), where S is the SCoP

/// statement. For this purpose, we analyze the set @p DepDelta, which

/// represents the differences between image elements and domain elements of

/// the corresponding map.

///

/// @param DepDelta The set contains the differences between image elements

/// and corresponding domain elements of the map, which

/// represents the dependency.

/// @param Dim The position of the index ki.

/// @param BoundDeltas In the case of indexes of ki, the difference between

/// image elements and corresponding domain elements

/// corresponds to the difference between lexicographic

/// minimum and lexicographic maximum of the corresponding

/// dimension of the domain of the statement.

/// @param IndexSet Obtained indexes ki, which describe the dependency.

/// @return True if dependencies correspond to the tensor contraction

/// and false, otherwise.

static bool isReductionCarriedOverDim(isl::set DepDelta, unsigned Dim,

isl::pw_multi_aff BoundDeltas,

const SmallDenseSet<int> &IndexSet) {

isl::space Space = DepDelta.get_space();

isl::set Superset = isl::set::universe(Space);

for (unsigned i = 0; i < Dim; i += 1)

Superset = Superset.fix_si(isl::dim::set, i, 0);

Superset = Superset.fix_si(isl::dim::set, Dim, 1);

// Check that the difference between the image element and the domain element

// is equal to one in the case of the index ki. Image elements and

// corresponding domain elements should be equal in the case of positions,

// which are lower than the specified position.

if (!DepDelta.is_subset(Superset))

return false;

MeinersburUnsubmitted

Not Done

SmallDenseSet<int> *IndexSet, isl::set Domain) {

- isl::union_map Dep = D->getDependences(Dependences::TYPE_RAW);

- isl::union_map Red = D->getDependences(Dependences::TYPE_RED);

- if (!Red.is_null())

- Dep = Dep.unite(Red);

+ isl::union_map Dep = D->getDependences(Dependences::TYPE_RAW | Dependences::TYPE_RED);

isl::space DomainSpace = Schedule.get_space().domain();

Should we also check whether WAW, RAW dependences are incompatible?

Meinersbur: Should we also check whether WAW, RAW dependences are incompatible?

gareevromanAuthorUnsubmitted

Done

As far as I understand, that is not necessary, because subsequently we check that the statement has the form C(shuffle(I, J)) = E(A(shuffle(I, P)),B(shuffle(P, J))C(shuffle(I, J))), where E is an expression that contains reads from the tensors A, B, C, and an arbitrary number of reads from constants with respect to bundles I, J , and P.

I have added a comment that describes this.

"The form of anti and output dependencies is determined specified by the form the SCoP statement, which is checked by subsequent analysis."

gareevroman: As far as I understand, that is not necessary, because subsequently we check that the statement…

// Compute a set, which is used to analyze how values of

// the domain are related to the map that describes the dependency.

isl_pw_multi_aff *DepDeltaPW = isl_pw_multi_aff_from_set(DepDelta.copy());

BoundDeltas = BoundDeltas.add(isl::manage(DepDeltaPW));

isl_set *ComplementRawSet = isl_set_from_pw_multi_aff(BoundDeltas.release());

isl::set Complement = isl::manage(ComplementRawSet);

MeinersburUnsubmitted

Not Done

lexmin/lexmax can be expensive. Wrap into a IslMaxOperationsGuard?

Meinersbur: lexmin/lexmax can be expensive. Wrap into a `IslMaxOperationsGuard`?

gareevromanAuthorUnsubmitted

Done

What is the maximal amount of computational steps we should use by default? I set it to 500000 according to DependenceInfo.cpp.

gareevroman: What is the maximal amount of computational steps we should use by default? I set it to 500000…

for (unsigned i : rangeIslSize(Dim + 1, DepDelta.dim(isl::dim::set))) {

MeinersburUnsubmitted

Done

You seem to assume an functional relationship from here on. If that's the case, you can keep the type a pw_multi_aff which supports more functions that you may have missed such as pw_multi_aff::add

Meinersbur: You seem to assume an functional relationship from here on. If that's the case, you can keep…

if (!IndexSet.count(i)) {

// Check the difference between the image element and the domain element

MeinersburUnsubmitted

Done

Consider using reverse(rangeIslSize(0,DeltasDimNum)) (from ISLTools.h).

Meinersbur: Consider using `reverse(rangeIslSize(0,DeltasDimNum))` (from ISLTools.h).

// in the case of indexes, which do not describe the dependency.

if (DepDelta.plain_get_val_if_fixed(isl::dim::set, i).is_zero())

continue;

return false;

}

MeinersburUnsubmitted

Not Done

This is going to check whether each element out of Intersection is a contraction over dimension i. Don't we also need to check that every iteration out of the band i is contributing to that contraction?

Meinersbur: This is going to check whether each element out of `Intersection` is a contraction over…

gareevromanAuthorUnsubmitted

Done

Could you clarify what do you mean by the band i? Are these indexes ki, which describe the dependencies?

isTcDep checks that the dependency has the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) -> S(..., ki + 1, min(k(i + 1)), ..., min(kn), …)

gareevroman: Could you clarify what do you mean by the band i? Are these indexes ki, which describe the…

MeinersburUnsubmitted

Not Done

There is a check Intersection.is_empty() which is going to detect if a dependency is completely missing. But what detects that only some of the dependencies are present. Such as:

[p] -> { Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and p != 0 }

{ Stmt3[i0, i1, i2, i3] -> Stmt3[o0, o1, o2, o3] : .... and i0 < 42 }

(assuming contracting over i=0)

isReductionCarriedOverDim doesn't seem to check whether the dependency is over the complete domain either.

Meinersbur: There is a check `Intersection.is_empty()` which is going to detect if a dependency is…

gareevromanAuthorUnsubmitted

Not Done

Are dependencies determined by the form of memory accesses? In the isCorrectAccessMap function, we check that memory accesses aren't partial. Isn't it sufficient?

I've tried to check whether the dependency is over the complete domain though.

gareevroman: Are dependencies determined by the form of memory accesses? In the isCorrectAccessMap function…

// In the case of other indexes, which describe the dependency,

// the difference between the image element and the domain element

// should be equal to the difference between lexicographic minimum and

// lexicographic maximum of the domain of the statement.

if (!Complement.plain_get_val_if_fixed(isl::dim::set, i).is_zero())

return false;

}

return true;

}

/// Check whether dependencies are over the complete domain.

///

/// In the case of the tensor contraction RAW, WAW, WAR dependencies

/// have the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) ->

/// S(..., ki + 1, min(k(i + 1)), ..., min(kn), ...), where S is the SCoP

/// statement. Consequently, the domain of the dependencies

/// can be described as

/// Domain / Domain ∩ S(…, max(kn),…) ∩ S(…, max(k(i + 1)),…),

/// where Domain is the domain of the statement S.

///

/// For example, in the case of the following tensor contraction,

/// corresponding domains will have the following form.

///

/// An example of the tensor contraction:

/// for (i = 0; i < 1024; i++)

/// for (j = 0; j < 1024; j++)

/// for (l = 0; l < 64; ++l)

/// for (w = 0; w < 64; ++w)

/// C[i][j] += A[i][l][w] * B[w][j][l];

///

/// The domain of the statement:

/// { S[i0, i1, i2, i3] : i0 >= 0 and i0 <= 1023 and

/// i1 >= 0 and i1 <= 1023 and

/// i2 >= 0 and i2 <= 63 and

/// i3 >= 0 and i3 <= 63 }

///

/// The domain of the dependencies:

/// { S[i0, i1, i2, i3] : (i0 >= 0 and i0 <= 1023 and

/// i1 >= 0 and i1 <= 1023 and

/// i2 >= 0 and i2 <= 63 and

/// i3 >= 0 and i3 <= 62) or

/// (i3 = 63 and i0 >= 0 and i0 <= 1023 and

/// i1 >= 0 and i1 <= 1023 and

/// i2 >= 0 and i2 <= 62) }

///

/// @param Domain The domain of the statement.

/// @param DepsForStmt RAW and RED dependencies for the statement.

/// @param UpperBound The lexicographic maximum of the elements in

/// the @p Domain.

/// @param IndexSet Obtained indexes ki, which describe the dependencies.

/// @return True if dependencies are over the complete domain

/// and false, otherwise.

static bool areDepsOverCompleteDomain(isl::set Domain, isl::map DepsForStmt,

isl::pw_multi_aff UpperBound,

SmallDenseSet<int> &IndexSet) {

isl_set *UpperBoundRawSet = isl_set_from_pw_multi_aff(UpperBound.copy());

isl::set UpperBoundSet = isl::manage(UpperBoundRawSet);

isl::set DomainRed = isl::manage(Domain.copy());

for (const auto It : IndexSet) {

isl::val FixedVal = UpperBoundSet.plain_get_val_if_fixed(isl::dim::set, It);

if (FixedVal.is_nan())

return false;

DomainRed = isl::manage(

isl_set_fix_val(DomainRed.copy(), isl_dim_set, It, FixedVal.release()));

}

return DepsForStmt.domain().intersect(Domain).is_equal(

Domain.subtract(DomainRed));

}

/// Check that dependencies correspond to the tensor contraction.

///

/// Check that there are only true dependencies of the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) ->

/// S(..., ki + 1, min(k(i + 1)), ..., min(kn), ...), where S is the SCoP

/// statement represented by @p Schedule. Such dependencies are produced by

/// the tensor contraction. Obtained indexes ki are stored into @p IndexSet.

///

/// The form of anti and output dependencies is specified implicitly by

/// the form the SCoP statement, which is checked by subsequent analysis.

///

/// @param Schedule The schedule of the SCoP statement.

/// @param D The SCoP dependencies.

/// @param Domain The domain of the statement.

/// @param IndexSet Obtained indexes ki, which describe the dependencies.

/// @return True if dependencies correspond to the tensor contraction

/// and false, otherwise.

static bool containsOnlyTcDeps(isl::map Schedule, const Dependences *D,

SmallDenseSet<int> &IndexSet, isl::set Domain) {

IslMaxOperationsGuard MaxOpGuard(Schedule.ctx().get(), OptComputeOut);

isl::union_map Dep =

D->getDependences(Dependences::TYPE_RAW | Dependences::TYPE_RED);

isl::space DomainSpace = Schedule.get_space().domain();

isl::space Space = DomainSpace.map_from_domain_and_range(DomainSpace);

isl::map DepsForStmt = Dep.extract_map(Space);

isl::set DepDeltas = DepsForStmt.deltas();

isl::size DeltasDimNum = DepDeltas.dim(isl::dim::set);

isl::pw_multi_aff LowerBound = Domain.lexmin_pw_multi_aff();

isl::pw_multi_aff UpperBound = Domain.lexmax_pw_multi_aff();

isl::pw_multi_aff BoundDeltas = UpperBound.sub(LowerBound);

MeinersburUnsubmitted

Done

This is not part of the pattern detection, but the optimization. Could we move it to the patch that does the actual optimization?

Meinersbur: This is not part of the pattern detection, but the optimization. Could we move it to the patch…

for (int i : reverse(rangeIslSize(0, DeltasDimNum))) {

// In the case of the tensor contraction, the difference between image

// elements and domain elements lies on a hyperplane where a dimension

// has the fixed value one.

isl::set Intersection = DepDeltas.fix_si(isl::dim::set, i, 1);

if (Intersection.is_empty())

continue;

if (!isReductionCarriedOverDim(Intersection, i, BoundDeltas, IndexSet))

return false;

IndexSet.insert(i);

MeinersburUnsubmitted

Done

Could you describe here what those 4 accesses are?

Meinersbur: Could you describe here what those 4 accesses are?

DepDeltas = DepDeltas.subtract(Intersection);

}

// In the case of the tensor contraction, all dependencies should have

MeinersburUnsubmitted

Done

should not be necessary; any permutation of the surrounding loops can be valid. Eg,

for (w = 0; w < 64; ++w)
  for (l = 0; l < 64; ++l)
    for (i = 0; i < 1024; i++)
      for (j = 0; j < 1024; j++)
           C[i][j] += A[i][l][w] * B[w][j][l];

yields the same result.

Meinersbur: 5. should not be necessary; any permutation of the surrounding loops can be valid. Eg, ```…

// the previously described form.

MeinersburUnsubmitted

Done

Could you add a high-level description how the algorithm actually works? I.e. dependencies used to determine contraction dimensions, etc.

Meinersbur: Could you add a high-level description how the algorithm actually works? I.e. dependencies used…

if ((unsignedFromIslSize(DeltasDimNum) == 0) || !DepDeltas.is_empty())

return false;

return areDepsOverCompleteDomain(Domain, DepsForStmt, UpperBound, IndexSet);

}

/// Check if the SCoP statement could probably be optimized with analytical

/// modeling.

///

/// containsTCInfoTy tries to determine whether the following conditions

MeinersburUnsubmitted

Not Done

Prefer Node.isa<isl::schedule_node_leaf>() (and then typed subclass: Node.as<isl_schedule_node_leaf>())

Meinersbur: Prefer `Node.isa<isl::schedule_node_leaf>()` (and then typed subclass: `Node.

gareevromanAuthorUnsubmitted

Done

Could we factor out this condition into ScheduleTreeOptimizer::isPMOptimizableBandNode, since it is common for isTCPattern and isMatrMultPattern functions? A new version of the patch shows how it could look like.

gareevroman: Could we factor out this condition into ScheduleTreeOptimizer::isPMOptimizableBandNode, since…

/// are true:

///

/// 1. The last memory access modeling an array, MA1, represents writing to

/// memory and has the form S(..., I, ..., J, ...) -> M(shuffle(I, J)),

/// where S is the SCoP statement under consideration and shuffle(I, J)

/// is a permutation of indexes of sets I and J.

/// 2. There are only true dependencies of the form

/// S(..., ki, max(k(i + 1)), ..., max(kn), ...) ->

MeinersburUnsubmitted

Done

This condition is effectively identical to the next

Meinersbur: This condition is effectively identical to the next

/// S(..., ki + 1, min(k(i + 1)), ..., min(kn), ...), where S is the SCoP

/// statement represented by @p Schedule and ki are indexes of the set P.

/// 3. SCoP contains an arbitrary number of reads from constants and only three

MeinersburUnsubmitted

Done

This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle for now.

Meinersbur: This constraint should not be intrinsic to the algorithm, but I agree it to be easier to handle…

gareevromanAuthorUnsubmitted

Done

Could we add a TODO comment for this?

gareevroman: Could we add a TODO comment for this?

MeinersburUnsubmitted

Done

Yes, that would be great.

Meinersbur: Yes, that would be great.

gareevromanAuthorUnsubmitted

Done

Ok. I've left that TODO comment.

gareevroman: Ok. I've left that TODO comment.

/// access relations, MA2, MA3, and MA4 that represent reading from memory

MeinersburUnsubmitted

Done

/// 3. SCoP contains an arbitrary number of reads from constants and only three

- /// access relations, MA2, MA3, and MA4 that epresent reading from memory

+ /// access relations, MA2, MA3, and MA4 that represent reading from memory

/// and have the form

[typo]

Meinersbur: [typo]

/// and have the form

/// S(..., I, ..., P, ...) -> M(shuffle(I, P)),

/// S(..., P, ..., J, ...) -> M(shuffle(J, P)),

MeinersburUnsubmitted

Done

[style] No Almost-Always-Auto in LLVM's coding style.

Meinersbur: [style] [[ https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more…

/// S(...) -> M(shuffle(I, J)), respectively.

///

MeinersburUnsubmitted

Done

auto NodeType = isl_schedule_node_get_type(Node.get());

- // Check that all predecessors of the node contain all band nodes

+ // Check that all ancestors of the node contain all band nodes

// for the statement, which represents the tensor contraction.

Meinersbur:

gareevromanAuthorUnsubmitted

Done

Looks like I missed that. Sorry. I will fix it in the next patch.

gareevroman: Looks like I missed that. Sorry. I will fix it in the next patch.

/// @param PartialSchedule The PartialSchedule that contains a SCoP statement

/// to check.

/// @param D The SCoP dependencies.

MeinersburUnsubmitted

Not Done

This looks for the outermost node that is not a filter or band. Is it possible that while that outermost node is not a TC contraction, one of the inner ones might? What if the outermost node is a filter, looks like it would just return false in this case.

Meinersbur: This looks for the outermost node that is not a filter or band. Is it possible that while that…

gareevromanAuthorUnsubmitted

Done

If I am not mistaken, this only checks that all band nodes, which represent the statement, are not split by filter nodes. These accepts a straightforward implementation of TC with/without delicm. For example,

domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and

                              0 <= i1 <= 1799 and
                              0 <= i2 <= 2199;
Stmt_for_body3[i0, i1] :      0 <= i0 <= 1599 and
                              0 <= i1 <= 1799;
Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and
                              0 <= i1 <= 1799 }"

child:

sequence:
- filter: "{ Stmt_for_body3[i0, i1] }"
  child:
    schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]"
    permutable: 1
    coincident: [ 1, 1 ]
- filter: "{ Stmt_for_body3_last[i0, i1] }"
  child:
    schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]"
    permutable: 1
    coincident: [ 1, 1 ]
- filter: "{ Stmt_for_body8[i0, i1, i2] }"
  child:
    schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] },
                { Stmt_for_body8[i0, i1, i2] -> [(i1)] },
                { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]"
    permutable: 1
    coincident: [ 1, 1, 0 ]

domain: "{ Stmt2[i0, i1, i2] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and 0 <= i2 <= 31 }"
child:

schedule: "[{ Stmt2[i0, i1, i2] -> [(i0)] }, { Stmt2[i0, i1, i2] -> [(i1)] }, { Stmt2[i0, i1, i2] -> [(i2)] }]"
permutable: 1
coincident: [ 1, 1, 0 ]

Sorry, I have not committed an updated version of the optimization of TC to my github repo. However, I believe that, if this is that case, we can safely replace all such nodes.

+ auto NodeType = isl_schedule_node_get_type(Node.get());
+ while ((NodeType != isl_schedule_node_domain) &&
+ (NodeType != isl_schedule_node_filter)) {
+ assert((NodeType != isl_schedule_node_sequence) &&
+ L"Prevent the undefined behavior");
+ Node = Node.parent();
+ NodeType = isl_schedule_node_get_type(Node.get());
+ }
+ Node = Node.child(0);
+ Node = isl::manage(isl_schedule_node_cut(Node.release()));
+ return Node.insert_partial_schedule(Dimensions);

I think taht the detection of a more sophisticated implementations of TC is a possible goal of a future research.

I have described this in the comment.

gareevroman: If I am not mistaken, this only checks that all band nodes, which represent the statement, are…

MeinersburUnsubmitted

Done

I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."

Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)

Meinersbur: I think some info in the comment like "all surrounding band nodes are assumed to be part of the…

gareevromanAuthorUnsubmitted

Done

I think some info in the comment like "all surrounding band nodes are assumed to be part of the TC and must not be interleaved by filter nodes."

I've added it it.

Since it is not checking for it, it seems to imply that all other nodes types are OK? (sequence, set, expansion, extension, marker). Maybe reject them too? (I think ignoring marker nodes might still be ok)

Sequence nodes could be necessary, if DeLICM was applied. Please, see the example inside the isTCPattern. Yes, I think other types except for marker nodes should be rejected. Additionally, as a precaution, I propose to check that a filter node has only a sequence and a domain nodes as its predecessors. I've updated the patch.

gareevroman: > I think some info in the comment like "all surrounding band nodes are assumed to be part of…

/// @param TCI Parameters of the tensor contraction operands.

/// @param Domain The domain of the statement.

/// @return True if dependencies and memory accesses correspond to the tensor

/// contraction and false, otherwise.

static bool containsTCInfoTy(isl::map PartialSchedule, const Dependences *D,

TCInfoTy &TCI, isl::set Domain) {

if (!containsOnlyTcDeps(PartialSchedule, D, TCI.P, Domain))

return false;

// TODO: handle cases of scalar multiplication if needed.

if (TCI.P.size() == 0)

return false;

if (!containsOnlyTCAcc(Domain, PartialSchedule, TCI))

return false;

// TODO: handle cases of GEMV if needed.

if ((TCI.I.size() == 0) || (TCI.J.size() == 0))

return false;

return true;

}

/// Check if this node contains a partial schedule that could

/// probably be optimized with analytical modeling.

///

/// isTCPattern is used to determine whether the SCoP represents a TC-like

/// kernel [1], which is a perfectly nested set of loops, with a data usage

/// pattern that is similar to that produced by the tensor contraction.

///

/// A TC-like kernel can be defined as follows:

///

/// 1. It satisfies the requirements of the polyhedral model.

/// 2. Without loss of generality, it contains three nonempty bundles of

/// one-dimensional for-loops with induction variables that are grouped into

/// bundles I = i0...i(r-1), J = j0..j(s-1), and P = p0...p(t-1), and they

/// are incremented by one.

/// 3. The innermost loop body can be represented as a statement of the form

/// C(shuffle(I, J)) = E(A(shuffle(I, P)), B(shuffle(P, J)),

/// C(shuffle(I, J))), where A(shuffle(I, P)), B(shuffle(P, J)),

/// C(shuffle(I, J)) are accesses to tensors A, B, C, respectively,

/// shuffle(I, J), shuffle(I, P), and shuffle(P, J) are permutations of the

/// enclosed indices, and E is an expression that contains reads from

/// the tensors A, B, C, and an arbitrary number of reads from constants

/// with respect to bundles I, J, and P.

///

/// TC can be considered as a particular case of a TC-like kernel.

///

/// The order of loops with indexes from P should be preserved. Otherwise,

/// isTCPattern should check if a commutative operation is used.

///

/// isTCPattern performs the following steps to check whether the SCoP

/// corresponds to a definition of a TC-like kernel:

///

/// 1. Checks that the node is the innermost band node.

/// 2. Checks that the partial schedule contains only one statement.

/// 3. Check that all ancestors of the node contain all band nodes for

/// the statement and only mark nodes interleave such band nodes. This

/// corresponds to a straightforward implementation of TC.

/// 4. Analyses the dependencies to determine contraction dimensions.

/// 5. Check that the last memory access modeling an array, represents writing

/// to the result of the TC-like kernel.

/// 6. Check that SCoP contains only three access relations that represent

/// reading of the operands of the TC-like kernel and an arbitrary number of

/// reads from constants.

///

/// [1] - Gareev R., Grosser T., Kruse M. High-Performance Generalized Tensor

/// Operations: A Compiler-Oriented Approach // ACM Transactions

/// Architecture and Code Optimization (TACO). 2018.

/// Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.

///

/// If this is the case, we could logically represent tensors as matrices and

/// apply algorithms, which are used to get close-to-peak performance of

MeinersburUnsubmitted

Done

What is Goto here? GotoBLAS?

Meinersbur: What is Goto here? GotoBLAS?

/// matrix multiplications in manually tuned BLAS libraries (e.g., BLIS).

///

/// @param Node The node to check.

/// @param D The SCoP dependencies.

/// @param TCI Parameters of the tensor contraction operands.

static bool isTCPattern(isl::schedule_node Node, const Dependences *D,

TCInfoTy &TCI) {

Node = Node.child(0);

isl::union_map PartialSchedule = Node.get_prefix_schedule_union_map();

isl::union_set Domain = Node.domain();

Node = Node.parent();

// The partial schedule should contain only one statement.

// TODO: This constraint should not be intrinsic to the algorithm.

if (isl_union_set_n_set(Domain.get()) != 1)

return false;

isl_schedule_node_type NodeType = isl_schedule_node_get_type(Node.get());

// Check that all ancestors of the node contain all band nodes for

// the statement, which represents the TC-like kernel, and only mark nodes

// interleave such band nodes. This corresponds to a straightforward

// implementation of TC with/without DeLICM applied.

// For example, this covers the matrix multiplication pattern after a full

// run of -polly-optree and -polly-delicm, where the write access is not

// through the original memory access, but trough a PHI node that was

// delicmed. Subsequently, such band nodes will be replaced by a single band

// node.

// The corresponding schedule can be the following, where Stmt_for_body8

// contains the matrix multiplication:

// domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and

// 0 <= i1 <= 1799 and

// 0 <= i2 <= 2199;

// Stmt_for_body3[i0, i1] : 0 <= i0 <= 1599 and

// 0 <= i1 <= 1799;

// Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and

// 0 <= i1 <= 1799 }"

// child:

// sequence:

// - filter: "{ Stmt_for_body3[i0, i1] }"

// child:

// schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] },

// { Stmt_for_body3[i0, i1] -> [(i1)] }]"

// permutable: 1

// coincident: [ 1, 1 ]

// - filter: "{ Stmt_for_body3_last[i0, i1] }"

// child:

// schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] },

// { Stmt_for_body3_last[i0, i1] -> [(i1)] }]"

// permutable: 1

// coincident: [ 1, 1 ]

// - filter: "{ Stmt_for_body8[i0, i1, i2] }"

// child:

// schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] },

// { Stmt_for_body8[i0, i1, i2] -> [(i1)] },

// { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]"

// permutable: 1

// coincident: [ 1, 1, 0 ]

while (NodeType != isl_schedule_node_domain) {

if (NodeType == isl_schedule_node_filter) {

if (!Node.parent().isa<isl::schedule_node_sequence>() ||

!Node.parent().parent().isa<isl::schedule_node_domain>())

return false;

break;

}

if ((NodeType != isl_schedule_node_band) &&

(NodeType != isl_schedule_node_mark))

return false;

Node = Node.parent();

NodeType = isl_schedule_node_get_type(Node.get());

}

isl::map PartialScheduleMap = isl::map::from_union_map(PartialSchedule);

if (containsTCInfoTy(PartialScheduleMap, D, TCI, isl::set(Domain)))

return true;

return false;

}

} // namespace } // namespace

isl::schedule_node isl::schedule_node

polly::tryOptimizeMatMulPattern(isl::schedule_node Node, polly::tryOptimizeMatMulPattern(isl::schedule_node Node,

const llvm::TargetTransformInfo *TTI, const llvm::TargetTransformInfo *TTI,

const Dependences *D) { const Dependences *D) {

TCInfoTy TCI;

if (PMBasedTCOpts && isTCPattern(Node, D, TCI))

LLVM_DEBUG(dbgs() << "The tensor contraction pattern was detected\n");

MatMulInfoTy MMI; MatMulInfoTy MMI;

if (isMatrMultPattern(Node, D, MMI)) { if (PMBasedMMMOpts && isMatrMultPattern(Node, D, MMI)) {

LLVM_DEBUG(dbgs() << "The matrix multiplication pattern was detected\n"); LLVM_DEBUG(dbgs() << "The matrix multiplication pattern was detected\n");

return optimizeMatMulPattern(Node, TTI, MMI); return optimizeMatMulPattern(Node, TTI, MMI);

} }

return {}; return {};

} }

polly/lib/Transform/ScheduleOptimizer.cpp

Show First 20 Lines • Show All 290 Lines • ▼ Show 20 Lines	private:
/// Check if this node is a band node we want to tile.		/// Check if this node is a band node we want to tile.
///		///
/// We look for innermost band nodes where individual dimensions are marked as		/// We look for innermost band nodes where individual dimensions are marked as
/// permutable.		/// permutable.
///		///
/// @param Node The node to check.		/// @param Node The node to check.
static bool isTileableBandNode(isl::schedule_node Node);		static bool isTileableBandNode(isl::schedule_node Node);

		/// Check if this node is a band node we want to transform using pattern
		/// matching.
		///
		/// We look for innermost band nodes where individual dimensions are marked as
		/// permutable. There is no restriction on the number of individual
		/// dimensions.
		///
		/// @param Node The node to check.
		static bool isPMOptimizableBandNode(isl::schedule_node Node);

/// Pre-vectorizes one scheduling dimension of a schedule band.		/// Pre-vectorizes one scheduling dimension of a schedule band.
///		///
/// prevectSchedBand splits out the dimension DimToVectorize, tiles it and		/// prevectSchedBand splits out the dimension DimToVectorize, tiles it and
/// sinks the resulting point loop.		/// sinks the resulting point loop.
///		///
/// Example (DimToVectorize=0, VectorWidth=4):		/// Example (DimToVectorize=0, VectorWidth=4):
///		///
/// \| Before transformation:		/// \| Before transformation:
▲ Show 20 Lines • Show All 144 Lines • ▼ Show 20 Lines	if (isl_schedule_node_get_type(Child.get()) != isl_schedule_node_filter)
return false;		return false;
if (isl_schedule_node_get_type(Child.child(0).get()) !=		if (isl_schedule_node_get_type(Child.child(0).get()) !=
isl_schedule_node_leaf)		isl_schedule_node_leaf)
return false;		return false;
}		}
return true;		return true;
}		}

bool ScheduleTreeOptimizer::isTileableBandNode(isl::schedule_node Node) {		/// Check if this node is a band node, which has only one child.
		///
		/// @param Node The node to check.
		static bool isOneTimeParentBandNode(isl::schedule_node Node) {
if (isl_schedule_node_get_type(Node.get()) != isl_schedule_node_band)		if (isl_schedule_node_get_type(Node.get()) != isl_schedule_node_band)
return false;		return false;

if (isl_schedule_node_n_children(Node.get()) != 1)		if (isl_schedule_node_n_children(Node.get()) != 1)
return false;		return false;

		return true;
		}

		bool ScheduleTreeOptimizer::isTileableBandNode(isl::schedule_node Node) {
		if (!isOneTimeParentBandNode(Node))
		return false;

if (!isl_schedule_node_band_get_permutable(Node.get()))		if (!isl_schedule_node_band_get_permutable(Node.get()))
return false;		return false;

auto Space = isl::manage(isl_schedule_node_band_get_space(Node.get()));		auto Space = isl::manage(isl_schedule_node_band_get_space(Node.get()));

if (unsignedFromIslSize(Space.dim(isl::dim::set)) <= 1u)		if (unsignedFromIslSize(Space.dim(isl::dim::set)) <= 1u)
return false;		return false;

MeinersburUnsubmitted Done Reply Inline Actions Instead of modifying the idea of whether a node is tilable, consider introducing another constraint-checking function, as we should have done with prevectorization as well. Meinersbur: Instead of modifying the idea of whether a node is tilable, consider introducing another…
return isSimpleInnermostBand(Node);		return isSimpleInnermostBand(Node);
}		}

		bool ScheduleTreeOptimizer::isPMOptimizableBandNode(isl::schedule_node Node) {
		if (!isOneTimeParentBandNode(Node))
		return false;

		return Node.child(0).isa<isl::schedule_node_leaf>();
		}

__isl_give isl::schedule_node		__isl_give isl::schedule_node
ScheduleTreeOptimizer::applyTileBandOpt(isl::schedule_node Node) {		ScheduleTreeOptimizer::applyTileBandOpt(isl::schedule_node Node) {
if (FirstLevelTiling) {		if (FirstLevelTiling) {
Node = tileNode(Node, "1st level tiling", FirstLevelTileSizes,		Node = tileNode(Node, "1st level tiling", FirstLevelTileSizes,
FirstLevelDefaultTileSize);		FirstLevelDefaultTileSize);
FirstLevelTileOpts++;		FirstLevelTileOpts++;
}		}

Show All 29 Lines
__isl_give isl_schedule_node *		__isl_give isl_schedule_node *
ScheduleTreeOptimizer::optimizeBand(__isl_take isl_schedule_node *NodeArg,		ScheduleTreeOptimizer::optimizeBand(__isl_take isl_schedule_node *NodeArg,
void *User) {		void *User) {
const OptimizerAdditionalInfoTy *OAI =		const OptimizerAdditionalInfoTy *OAI =
static_cast<const OptimizerAdditionalInfoTy *>(User);		static_cast<const OptimizerAdditionalInfoTy *>(User);
assert(OAI && "Expecting optimization options");		assert(OAI && "Expecting optimization options");

isl::schedule_node Node = isl::manage(NodeArg);		isl::schedule_node Node = isl::manage(NodeArg);
if (!isTileableBandNode(Node))
return Node.release();

if (OAI->PatternOpts) {		if (OAI->PatternOpts && isPMOptimizableBandNode(Node)) {
isl::schedule_node PatternOptimizedSchedule =		isl::schedule_node PatternOptimizedSchedule =
tryOptimizeMatMulPattern(Node, OAI->TTI, OAI->D);		tryOptimizeMatMulPattern(Node, OAI->TTI, OAI->D);
if (!PatternOptimizedSchedule.is_null()) {		if (!PatternOptimizedSchedule.is_null()) {
MatMulOpts++;		MatMulOpts++;
OAI->DepsChanged = true;		OAI->DepsChanged = true;
return PatternOptimizedSchedule.release();		return PatternOptimizedSchedule.release();
}		}
}		}

		if (!isTileableBandNode(Node))
		return Node.release();

if (OAI->Postopts)		if (OAI->Postopts)
Node = applyTileBandOpt(Node);		Node = applyTileBandOpt(Node);

if (OAI->Prevect) {		if (OAI->Prevect) {
// FIXME: Prevectorization requirements are different from those checked by		// FIXME: Prevectorization requirements are different from those checked by
// isTileableBandNode.		// isTileableBandNode.
Node = applyPrevectBandOpt(Node);		Node = applyPrevectBandOpt(Node);
}		}
▲ Show 20 Lines • Show All 515 Lines • Show Last 20 Lines

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll

; RUN: opt %loadPolly \

; RUN: -polly-pattern-matching-based-opts=true \

; RUN: -polly-optree -polly-delicm -polly-simplify \

; RUN: -polly-opt-isl -debug < %s 2>&1 \

; RUN: -polly-opt-isl -polly-tc-opt=true -debug < %s 2>&1 \

; RUN: | FileCheck %s

; REQUIRES: asserts

; Check that the pattern matching detects the matrix multiplication pattern

; after a full run of -polly-optree and -polly-delicm, where the write access

; is not through the original memory access, but trough a PHI node that was

; delicmed. This test covers the polybench 2mm and 3mm cases.

;

; This test case generates the following schedule, which contains filters:

MeinersburUnsubmitted

Done

; delicmed. This test covers the polybench 2mm and 3mm cases.

;

- ; This test case generates the following schedule, which contans filters:

+ ; This test case generates the following schedule, which contains filters:

;

; domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and

Meinersbur:

;

; domain: "{ Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1599 and

; 0 <= i1 <= 1799 and

; 0 <= i2 <= 2199;

; Stmt_for_body3[i0, i1] : 0 <= i0 <= 1599 and

; 0 <= i1 <= 1799;

; Stmt_for_body3_last[i0, i1] : 0 <= i0 <= 1599 and

; 0 <= i1 <= 1799 }"

; child:

; sequence:

; - filter: "{ Stmt_for_body3[i0, i1] }"

; child:

; schedule: "[{ Stmt_for_body3[i0, i1] -> [(i0)] }, { Stmt_for_body3[i0, i1] -> [(i1)] }]"

; permutable: 1

; coincident: [ 1, 1 ]

; - filter: "{ Stmt_for_body3_last[i0, i1] }"

; child:

; schedule: "[{ Stmt_for_body3_last[i0, i1] -> [(i0)] }, { Stmt_for_body3_last[i0, i1] -> [(i1)] }]"

; permutable: 1

; coincident: [ 1, 1 ]

; - filter: "{ Stmt_for_body8[i0, i1, i2] }"

; child:

; schedule: "[{ Stmt_for_body8[i0, i1, i2] -> [(i0)] },

; { Stmt_for_body8[i0, i1, i2] -> [(i1)] },

; { Stmt_for_body8[i0, i1, i2] -> [(i2)] }]"

; permutable: 1

; coincident: [ 1, 1, 0 ]

;

; CHECK: The tensor contraction pattern was detected

; CHECK: The matrix multiplication pattern was detected

;

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: norecurse nounwind uwtable

define void @kernel_2mm(i32 %ni, i32 %nj, i32 %nk, i32 %nl, double %alpha, double %beta, [1800 x double]* nocapture %tmp, [2200 x double]* nocapture readonly %A, [1800 x double]* nocapture readonly %B, [2400 x double]* nocapture readnone %C, [2400 x double]* nocapture readnone %D) local_unnamed_addr #0 {

▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll

This file was added.

; RUN: opt %loadPolly -polly-delicm -polly-simplify -polly-opt-isl \

; RUN: -polly-pattern-matching-based-opts=true \

; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

; REQUIRES: asserts

MeinersburUnsubmitted

Not Done

; RUN: opt %loadPolly -polly-delicm -polly-simplify -polly-opt-isl \

; RUN: -polly-pattern-matching-based-opts=true \

- ; RUN: -polly-tc-opt=true -debug < %s 2>&1 | FileCheck %s

+ ; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

; REQUIRES: asserts

Since this is not FileCheck-ing the LLVM-IR output, suppress it with -disable-output

Meinersbur: Since this is not FileCheck-ing the LLVM-IR output, suppress it with `-disable-output`

gareevromanAuthorUnsubmitted

Done

Could we fix the existing test cases in a separate patch?

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll

; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s
; REQUIRES: asserts

polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

polly/test/ScheduleOptimizer/pattern-matching-based-opts_17.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

polly/test/ScheduleOptimizer/pattern-matching-based-opts_18.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

polly/test/ScheduleOptimizer/pattern-matching-based-opts_19.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

polly/test/ScheduleOptimizer/pattern-matching-based-opts_20.ll

; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 | FileCheck %s

gareevroman: Could we fix the existing test cases in a separate patch? polly/test/ScheduleOptimizer/pattern…

MeinersburUnsubmitted

Done

👍

Meinersbur: 👍

gareevromanAuthorUnsubmitted

Done

Ok. I've added this to my TODO list.

gareevroman: Ok. I've added this to my TODO list.

;

; Check that the pattern matching detects the tensor contraction pattern

; after a full run of -polly-delicm. This test case generates the following

; schedule, which contans two band nodes. Without DeLICM two statement are

; generated.

;

; domain: "{ Stmt5[i0, i1, i2, i3, i4, i5] : 0 <= i0 <= 31 and 0 <= i1 <= 31 and

; 0 <= i2 <= 31 and 0 <= i3 <= 31 and

; 0 <= i4 <= 31 and 0 <= i5 <= 31 }"

; child:

; schedule: "[{ Stmt5[i0, i1, i2, i3, i4, i5] -> [(i0)] },

; { Stmt5[i0, i1, i2, i3, i4, i5] -> [(i1)] },

; { Stmt5[i0, i1, i2, i3, i4, i5] -> [(i2)] },

; { Stmt5[i0, i1, i2, i3, i4, i5] -> [(i4)] },

; { Stmt5[i0, i1, i2, i3, i4, i5] -> [(i3)] }]"

; permutable: 1

; coincident: [ 1, 1, 1, 1, 0 ]

; child:

; schedule: "[{ Stmt5[i0, i1, i2, i3, i4, i5] -> [(i5)] }]"

; permutable: 1

; child:

; leaf

;

; for (i = 0; i < 32; i++)

; for (j = 0; j < 32; j++)

; for (k = 0; k < 32; ++k)

; for (l = 0; l < 32; ++l)

; for (w = 0; w < 32; ++w)

; for (q = 0; q < 32; ++q)

; C[i][j][k][w] += A[i][l][j][q] * B[q][w][l][k];

;

; CHECK: The tensor contraction pattern was detected

;

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"

target triple = "x86_64-unknown-linux-gnu"

define internal fastcc void @kernel_tc([32 x [32 x [32 x double]]]* nocapture %C, [32 x [32 x [32 x double]]]* nocapture readonly %A, [32 x [32 x [32 x double]]]* nocapture readonly %B) {

entry:

br label %for.cond1.preheader

for.cond1.preheader: ; preds = %for.inc50, %entry

%indvars.iv19 = phi i64 [ 0, %entry ], [ %indvars.iv.next20, %for.inc50 ]

br label %for.cond4.preheader

for.cond4.preheader: ; preds = %for.inc47, %for.cond1.preheader

%indvars.iv16 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next17, %for.inc47 ]

br label %for.cond7.preheader

for.cond7.preheader: ; preds = %for.inc44, %for.cond4.preheader

%indvars.iv13 = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next14, %for.inc44 ]

br label %for.cond10.preheader

for.cond10.preheader: ; preds = %for.inc41, %for.cond7.preheader

%indvars.iv10 = phi i64 [ 0, %for.cond7.preheader ], [ %indvars.iv.next11, %for.inc41 ]

br label %for.cond13.preheader

for.cond13.preheader: ; preds = %for.inc38, %for.cond10.preheader

%indvars.iv7 = phi i64 [ 0, %for.cond10.preheader ], [ %indvars.iv.next8, %for.inc38 ]

%arrayidx37 = getelementptr inbounds [32 x [32 x [32 x double]]], [32 x [32 x [32 x double]]]* %C, i64 %indvars.iv19, i64 %indvars.iv16, i64 %indvars.iv13, i64 %indvars.iv7

%.pre = load double, double* %arrayidx37, align 8

br label %for.body15

for.body15: ; preds = %for.body15, %for.cond13.preheader

%i = phi double [ %.pre, %for.cond13.preheader ], [ %add, %for.body15 ]

%indvars.iv = phi i64 [ 0, %for.cond13.preheader ], [ %indvars.iv.next, %for.body15 ]

%arrayidx21 = getelementptr inbounds [32 x [32 x [32 x double]]], [32 x [32 x [32 x double]]]* %A, i64 %indvars.iv19, i64 %indvars.iv10, i64 %indvars.iv16, i64 %indvars.iv

%i1 = load double, double* %arrayidx21, align 8

%arrayidx29 = getelementptr inbounds [32 x [32 x [32 x double]]], [32 x [32 x [32 x double]]]* %B, i64 %indvars.iv, i64 %indvars.iv7, i64 %indvars.iv10, i64 %indvars.iv13

%i2 = load double, double* %arrayidx29, align 8

%mul = fmul fast double %i2, %i1

%add = fadd fast double %i, %mul

store double %add, double* %arrayidx37, align 8

%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1

%exitcond.not = icmp eq i64 %indvars.iv.next, 32

br i1 %exitcond.not, label %for.inc38, label %for.body15

for.inc38: ; preds = %for.body15

%indvars.iv.next8 = add nuw nsw i64 %indvars.iv7, 1

%exitcond9.not = icmp eq i64 %indvars.iv.next8, 32

br i1 %exitcond9.not, label %for.inc41, label %for.cond13.preheader

for.inc41: ; preds = %for.inc38

%indvars.iv.next11 = add nuw nsw i64 %indvars.iv10, 1

%exitcond12.not = icmp eq i64 %indvars.iv.next11, 32

br i1 %exitcond12.not, label %for.inc44, label %for.cond10.preheader

for.inc44: ; preds = %for.inc41

%indvars.iv.next14 = add nuw nsw i64 %indvars.iv13, 1

%exitcond15.not = icmp eq i64 %indvars.iv.next14, 32

br i1 %exitcond15.not, label %for.inc47, label %for.cond7.preheader

for.inc47: ; preds = %for.inc44

%indvars.iv.next17 = add nuw nsw i64 %indvars.iv16, 1

%exitcond18.not = icmp eq i64 %indvars.iv.next17, 32

br i1 %exitcond18.not, label %for.inc50, label %for.cond4.preheader

for.inc50: ; preds = %for.inc47

%indvars.iv.next20 = add nuw nsw i64 %indvars.iv19, 1

%exitcond21.not = icmp eq i64 %indvars.iv.next20, 32

br i1 %exitcond21.not, label %for.end52, label %for.cond1.preheader

for.end52: ; preds = %for.inc50

ret void

}

polly/test/ScheduleOptimizer/pattern-matching-based-opts.ll

	; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=false -debug < %s 2>&1 \| FileCheck %s			; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=false \
	; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -debug < %s 2>&1 \| FileCheck %s --check-prefix=PATTERN-MATCHING-OPTS			; RUN: -debug -polly-tc-opt < %s 2>&1 \| FileCheck %s
				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -debug -polly-tc-opt < %s 2>&1 \| FileCheck %s --check-prefix=PATTERN-MATCHING-OPTS
	; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -polly-ast-detect-parallel -polly-print-ast -disable-output < %s \| FileCheck %s --check-prefix=PARALLEL-AST			; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -polly-ast-detect-parallel -polly-print-ast -disable-output < %s \| FileCheck %s --check-prefix=PARALLEL-AST
	; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -stats -disable-output < %s 2>&1 \| FileCheck %s --check-prefix=STATS -match-full-lines			; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -stats -disable-output < %s 2>&1 \| FileCheck %s --check-prefix=STATS -match-full-lines
	; REQUIRES: asserts			; REQUIRES: asserts
	;			;
	; /* C := alphaAB + betaC /			; /* C := alphaAB + betaC /
	; for (i = 0; i < _PB_NI; i++)			; for (i = 0; i < _PB_NI; i++)
	; for (j = 0; j < _PB_NJ; j++)			; for (j = 0; j < _PB_NJ; j++)
	; {			; {
	; C[i][j] *= beta;			; C[i][j] *= beta;
	; for (k = 0; k < _PB_NK; ++k)			; for (k = 0; k < _PB_NK; ++k)
	; C[i][j] += alpha * A[i][k] * B[k][j];			; C[i][j] += alpha * A[i][k] * B[k][j];
	; }			; }
	;			;
	; CHECK-NOT: The matrix multiplication pattern was detected			; CHECK-NOT: The matrix multiplication pattern was detected
				; CHECK-NOT: The tensor contraction pattern was detected
				; PATTERN-MATCHING-OPTS: The tensor contraction pattern was detected
	; PATTERN-MATCHING-OPTS: The matrix multiplication pattern was detected			; PATTERN-MATCHING-OPTS: The matrix multiplication pattern was detected
	; PARALLEL-AST-NOT: #pragma known-parallel			; PARALLEL-AST-NOT: #pragma known-parallel
	; STATS: 1 polly-opt-isl - Number of matrix multiplication patterns detected and optimized			; STATS: 1 polly-opt-isl - Number of matrix multiplication patterns detected and optimized
	;			;
	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-unknown"			target triple = "x86_64-unknown-unknown"

	define internal void @kernel_gemm(i32 %arg, i32 %arg1, i32 %arg2, double %arg3, double %arg4, [1056 x double]* %arg5, [1024 x double]* %arg6, [1056 x double]* %arg7) #0 {			define internal void @kernel_gemm(i32 %arg, i32 %arg1, i32 %arg2, double %arg3, double %arg4, [1056 x double]* %arg5, [1024 x double]* %arg6, [1056 x double]* %arg7) #0 {
	▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

polly/test/ScheduleOptimizer/pattern-matching-based-opts_11.ll

	; RUN: opt %loadPolly -polly-import-jscop \			; RUN: opt %loadPolly -polly-import-jscop \
	; RUN: -polly-import-jscop-postfix=transformed \			; RUN: -polly-import-jscop-postfix=transformed \
	; RUN: -polly-pattern-matching-based-opts=true \			; RUN: -polly-pattern-matching-based-opts=true \
	; RUN: -polly-target-throughput-vector-fma=1 \			; RUN: -polly-target-throughput-vector-fma=1 \
	; RUN: -polly-target-latency-vector-fma=8 \			; RUN: -polly-target-latency-vector-fma=8 \
	; RUN: -polly-target-1st-cache-level-associativity=8 \			; RUN: -polly-target-1st-cache-level-associativity=8 \
	; RUN: -polly-target-2nd-cache-level-associativity=8 \			; RUN: -polly-target-2nd-cache-level-associativity=8 \
	; RUN: -polly-target-1st-cache-level-size=32768 \			; RUN: -polly-target-1st-cache-level-size=32768 \
	; RUN: -polly-target-vector-register-bitwidth=256 \			; RUN: -polly-target-vector-register-bitwidth=256 \
	; RUN: -polly-target-2nd-cache-level-size=262144 \			; RUN: -polly-target-2nd-cache-level-size=262144 \
	; RUN: -polly-opt-isl -debug < %s 2>&1 \			; RUN: -polly-opt-isl -debug \
				; RUN: -polly-tc-opt=true < %s 2>&1 \
	; RUN: \| FileCheck %s			; RUN: \| FileCheck %s
	; REQUIRES: asserts			; REQUIRES: asserts
	;			;
	; Check that the pattern matching detects the matrix multiplication pattern			; Check that the pattern matching detects the matrix multiplication pattern
	; in case scalar memory accesses were replaced by accesses to newly created			; in case scalar memory accesses were replaced by accesses to newly created
	; arrays.			; arrays.
	;			;
				; CHECK: The tensor contraction pattern was detected
	; CHECK: The matrix multiplication pattern was detected			; CHECK: The matrix multiplication pattern was detected
	;			;
	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-unknown"			target triple = "x86_64-unknown-unknown"

	define void @kernel_gemm(i32 %ni, i32 %nj, i32 %nk, double %A, [1024 x double]* %B, [1024 x double]* %C) {			define void @kernel_gemm(i32 %ni, i32 %nj, i32 %nk, double %A, [1024 x double]* %B, [1024 x double]* %C) {
	entry:			entry:
	br label %entry.split			br label %entry.split
	Show All 38 Lines

polly/test/ScheduleOptimizer/pattern-matching-based-opts_15.ll

	; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \			; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
	; RUN: -debug-only=polly-opt-isl -disable-output < %s 2>&1 \| FileCheck %s			; RUN: -debug-only=polly-opt-isl -disable-output \
				; RUN: -polly-tc-opt=true < %s 2>&1 \| FileCheck %s
	; REQUIRES: asserts			; REQUIRES: asserts
	;			;
	; for (i = 0; i < _PB_NI; i++)			; for (i = 0; i < _PB_NI; i++)
	; for (j = 0; j < _PB_NJ; j++)			; for (j = 0; j < _PB_NJ; j++)
	; {			; {
	; for (k = 0; k < _PB_NK; k++)			; for (k = 0; k < _PB_NK; k++)
	; {			; {
	; double Mul = A[i][k] * B[k][j];			; double Mul = A[i][k] * B[k][j];
	; D[i][j][k] += Mul;			; D[i][j][k] += Mul;
	; C[i][j] += Mul;			; C[i][j] += Mul;
	; }			; }
	; }			; }
	;			;
	; CHECK-NOT: The matrix multiplication pattern was detected			; CHECK-NOT: The matrix multiplication pattern was detected
				; CHECK-NOT: The tensor contraction pattern was detected

	target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	define void @kernel_gemm([1024 x double]* %C, [1024 x double]* %A, [1024 x double]* %B, [1024 x [1024 x double]]* %D) {			define void @kernel_gemm([1024 x double]* %C, [1024 x double]* %A, [1024 x double]* %B, [1024 x [1024 x double]]* %D) {
	entry:			entry:
	br label %for.cond1.preheader			br label %for.cond1.preheader

	Show All 40 Lines

polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (i = 0; i < 1024; i++)
				; for (j = 0; j < 1024; j++)
				; for (l = 0; l < 64; ++l)
				; for (w = 0; w < 64; ++w)
				; C[i][j] += A[i][l][w] * B[w][j][l];
				;
				; CHECK: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define internal void @kernel_tc(i32 %ni, i32 %nj, i32 %nl, i32 %nq, i32 %nw, double %alpha, double %beta, [1024 x double]* %C, [64 x [64 x double]]* %A, [1024 x [64 x double]]* %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc30, %entry
				%indvars.iv43 = phi i64 [ 0, %entry ], [ %indvars.iv.next44, %for.inc30 ]
				br label %for.cond4.preheader

				for.cond4.preheader: ; preds = %for.inc27, %for.cond1.preheader
				%indvars.iv40 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next41, %for.inc27 ]
				br label %for.cond7.preheader

				for.cond7.preheader: ; preds = %for.inc24, %for.cond4.preheader
				%indvars.iv37 = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next38, %for.inc24 ]
				br label %for.body9

				for.body9: ; preds = %for.body9, %for.cond7.preheader
				%indvars.iv = phi i64 [ 0, %for.cond7.preheader ], [ %indvars.iv.next, %for.body9 ]
				%arrayidx13 = getelementptr inbounds [64 x [64 x double]], [64 x [64 x double]]* %A, i64 %indvars.iv43, i64 %indvars.iv37, i64 %indvars.iv
				%i = load double, double* %arrayidx13, align 8
				%arrayidx19 = getelementptr inbounds [1024 x [64 x double]], [1024 x [64 x double]]* %B, i64 %indvars.iv, i64 %indvars.iv40, i64 %indvars.iv37
				%i1 = load double, double* %arrayidx19, align 8
				%mul = fmul fast double %i1, %i
				%arrayidx23 = getelementptr inbounds [1024 x double], [1024 x double]* %C, i64 %indvars.iv43, i64 %indvars.iv40
				%i2 = load double, double* %arrayidx23, align 8
				%add = fadd fast double %i2, %mul
				store double %add, double* %arrayidx23, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 64
				br i1 %exitcond, label %for.body9, label %for.inc24

				for.inc24: ; preds = %for.body9
				%indvars.iv.next38 = add nuw nsw i64 %indvars.iv37, 1
				%exitcond39 = icmp ne i64 %indvars.iv.next38, 64
				br i1 %exitcond39, label %for.cond7.preheader, label %for.inc27

				for.inc27: ; preds = %for.inc24
				%indvars.iv.next41 = add nuw nsw i64 %indvars.iv40, 1
				%exitcond42 = icmp ne i64 %indvars.iv.next41, 1024
				br i1 %exitcond42, label %for.cond4.preheader, label %for.inc30

				for.inc30: ; preds = %for.inc27
				%indvars.iv.next44 = add nuw nsw i64 %indvars.iv43, 1
				%exitcond45 = icmp ne i64 %indvars.iv.next44, 1024
				br i1 %exitcond45, label %for.cond1.preheader, label %for.end32

				for.end32: ; preds = %for.inc30
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_17.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (i = 0; i < 32; i++)
				; for (j = 0; j < 1024; j++)
				; for (k = 0; k < 32; ++k)
				; for (l = 0; l < 1024; ++l)
				; C[i][j][k] += A[i][k][l] * B[l][j];
				;
				; CHECK: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define internal void @kernel_tc(i32 %ni, i32 %nj, i32 %nk, i32 %nl, double %alpha, double %beta, [1024 x [32 x double]]* %C, [32 x [1024 x double]]* %A, [1024 x double]* %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc30, %entry
				%indvars.iv43 = phi i64 [ 0, %entry ], [ %indvars.iv.next44, %for.inc30 ]
				br label %for.cond4.preheader

				for.cond4.preheader: ; preds = %for.inc27, %for.cond1.preheader
				%indvars.iv40 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next41, %for.inc27 ]
				br label %for.cond7.preheader

				for.cond7.preheader: ; preds = %for.inc24, %for.cond4.preheader
				%indvars.iv37 = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next38, %for.inc24 ]
				br label %for.body9

				for.body9: ; preds = %for.body9, %for.cond7.preheader
				%indvars.iv = phi i64 [ 0, %for.cond7.preheader ], [ %indvars.iv.next, %for.body9 ]
				%arrayidx13 = getelementptr inbounds [32 x [1024 x double]], [32 x [1024 x double]]* %A, i64 %indvars.iv43, i64 %indvars.iv37, i64 %indvars.iv
				%i = load double, double* %arrayidx13, align 8
				%arrayidx17 = getelementptr inbounds [1024 x double], [1024 x double]* %B, i64 %indvars.iv, i64 %indvars.iv40
				%i1 = load double, double* %arrayidx17, align 8
				%mul = fmul fast double %i1, %i
				%arrayidx23 = getelementptr inbounds [1024 x [32 x double]], [1024 x [32 x double]]* %C, i64 %indvars.iv43, i64 %indvars.iv40, i64 %indvars.iv37
				%i2 = load double, double* %arrayidx23, align 8
				%add = fadd fast double %i2, %mul
				store double %add, double* %arrayidx23, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.body9, label %for.inc24

				for.inc24: ; preds = %for.body9
				%indvars.iv.next38 = add nuw nsw i64 %indvars.iv37, 1
				%exitcond39 = icmp ne i64 %indvars.iv.next38, 32
				br i1 %exitcond39, label %for.cond7.preheader, label %for.inc27

				for.inc27: ; preds = %for.inc24
				%indvars.iv.next41 = add nuw nsw i64 %indvars.iv40, 1
				%exitcond42 = icmp ne i64 %indvars.iv.next41, 1024
				br i1 %exitcond42, label %for.cond4.preheader, label %for.inc30

				for.inc30: ; preds = %for.inc27
				%indvars.iv.next44 = add nuw nsw i64 %indvars.iv43, 1
				%exitcond45 = icmp ne i64 %indvars.iv.next44, 32
				br i1 %exitcond45, label %for.cond1.preheader, label %for.end32

				for.end32: ; preds = %for.inc30
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_18.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (i = 0; i < 32; i++)
				; for (j = 0; j < 32; j++)
				; for (k = 0; k < 32; ++k)
				; for (l = 0; l < 32; ++l)
				; for (w = 0; w < 32; ++w)
				; for (q = 0; q < 32; ++q)
				; C[i][j][k][w] += A[i][l][j][q] * B[q][w][l][k];
				;
				; CHECK: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define internal void @kernel_tc(i32 %ni, i32 %nj, i32 %nk, i32 %nl, i32 %nq, i32 %nw, double %alpha, double %beta, [32 x [32 x [32 x double]]]* %C, [32 x [32 x [32 x double]]]* %A, [32 x [32 x [32 x double]]]* %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc50, %entry
				%indvars.iv71 = phi i64 [ 0, %entry ], [ %indvars.iv.next72, %for.inc50 ]
				br label %for.cond4.preheader

				for.cond4.preheader: ; preds = %for.inc47, %for.cond1.preheader
				%indvars.iv68 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next69, %for.inc47 ]
				br label %for.cond7.preheader

				for.cond7.preheader: ; preds = %for.inc44, %for.cond4.preheader
				%indvars.iv65 = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next66, %for.inc44 ]
				br label %for.cond10.preheader

				for.cond10.preheader: ; preds = %for.inc41, %for.cond7.preheader
				%indvars.iv62 = phi i64 [ 0, %for.cond7.preheader ], [ %indvars.iv.next63, %for.inc41 ]
				br label %for.cond13.preheader

				for.cond13.preheader: ; preds = %for.inc38, %for.cond10.preheader
				%indvars.iv59 = phi i64 [ 0, %for.cond10.preheader ], [ %indvars.iv.next60, %for.inc38 ]
				br label %for.body15

				for.body15: ; preds = %for.body15, %for.cond13.preheader
				%indvars.iv = phi i64 [ 0, %for.cond13.preheader ], [ %indvars.iv.next, %for.body15 ]
				%arrayidx21 = getelementptr inbounds [32 x [32 x [32 x double]]], [32 x [32 x [32 x double]]]* %A, i64 %indvars.iv71, i64 %indvars.iv62, i64 %indvars.iv68, i64 %indvars.iv
				%i = load double, double* %arrayidx21, align 8
				%arrayidx29 = getelementptr inbounds [32 x [32 x [32 x double]]], [32 x [32 x [32 x double]]]* %B, i64 %indvars.iv, i64 %indvars.iv59, i64 %indvars.iv62, i64 %indvars.iv65
				%i1 = load double, double* %arrayidx29, align 8
				%mul = fmul fast double %i1, %i
				%arrayidx37 = getelementptr inbounds [32 x [32 x [32 x double]]], [32 x [32 x [32 x double]]]* %C, i64 %indvars.iv71, i64 %indvars.iv68, i64 %indvars.iv65, i64 %indvars.iv59
				%i2 = load double, double* %arrayidx37, align 8
				%add = fadd fast double %i2, %mul
				store double %add, double* %arrayidx37, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 32
				br i1 %exitcond, label %for.body15, label %for.inc38

				for.inc38: ; preds = %for.body15
				%indvars.iv.next60 = add nuw nsw i64 %indvars.iv59, 1
				%exitcond61 = icmp ne i64 %indvars.iv.next60, 32
				br i1 %exitcond61, label %for.cond13.preheader, label %for.inc41

				for.inc41: ; preds = %for.inc38
				%indvars.iv.next63 = add nuw nsw i64 %indvars.iv62, 1
				%exitcond64 = icmp ne i64 %indvars.iv.next63, 32
				br i1 %exitcond64, label %for.cond10.preheader, label %for.inc44

				for.inc44: ; preds = %for.inc41
				%indvars.iv.next66 = add nuw nsw i64 %indvars.iv65, 1
				%exitcond67 = icmp ne i64 %indvars.iv.next66, 32
				br i1 %exitcond67, label %for.cond7.preheader, label %for.inc47

				for.inc47: ; preds = %for.inc44
				%indvars.iv.next69 = add nuw nsw i64 %indvars.iv68, 1
				%exitcond70 = icmp ne i64 %indvars.iv.next69, 32
				br i1 %exitcond70, label %for.cond4.preheader, label %for.inc50

				for.inc50: ; preds = %for.inc47
				%indvars.iv.next72 = add nuw nsw i64 %indvars.iv71, 1
				%exitcond73 = icmp ne i64 %indvars.iv.next72, 32
				br i1 %exitcond73, label %for.cond1.preheader, label %for.end52

				for.end52: ; preds = %for.inc50
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_19.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (i = 0; i < 8; i++)
				; for (j = 0; j < 8; j++)
				; for (k = 0; k < 4; ++k)
				; for (l = 0; l < 1024; ++l)
				; for (w = 0; w < 1024; ++w)
				; for (q = 0; q < 4; ++q)
				; C[i][j][k][w][q] += A[q][k][j][l][i] * B[l][w];
				;
				; CHECK: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define internal void @kernel_tc([8 x [4 x [1024 x [4 x double]]]]* %C, [4 x [8 x [1024 x [8 x double]]]]* %A, [1024 x double]* %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc50, %entry
				%indvars.iv71 = phi i64 [ 0, %entry ], [ %indvars.iv.next72, %for.inc50 ]
				br label %for.cond4.preheader

				for.cond4.preheader: ; preds = %for.inc47, %for.cond1.preheader
				%indvars.iv68 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next69, %for.inc47 ]
				br label %for.cond7.preheader

				for.cond7.preheader: ; preds = %for.inc44, %for.cond4.preheader
				%indvars.iv65 = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next66, %for.inc44 ]
				br label %for.cond10.preheader

				for.cond10.preheader: ; preds = %for.inc41, %for.cond7.preheader
				%indvars.iv62 = phi i64 [ 0, %for.cond7.preheader ], [ %indvars.iv.next63, %for.inc41 ]
				br label %for.cond13.preheader

				for.cond13.preheader: ; preds = %for.inc38, %for.cond10.preheader
				%indvars.iv59 = phi i64 [ 0, %for.cond10.preheader ], [ %indvars.iv.next60, %for.inc38 ]
				br label %for.body15

				for.body15: ; preds = %for.body15, %for.cond13.preheader
				%indvars.iv = phi i64 [ 0, %for.cond13.preheader ], [ %indvars.iv.next, %for.body15 ]
				%arrayidx23 = getelementptr inbounds [4 x [8 x [1024 x [8 x double]]]], [4 x [8 x [1024 x [8 x double]]]]* %A, i64 %indvars.iv, i64 %indvars.iv65, i64 %indvars.iv68, i64 %indvars.iv62, i64 %indvars.iv71
				%i = load double, double* %arrayidx23, align 8
				%arrayidx27 = getelementptr inbounds [1024 x double], [1024 x double]* %B, i64 %indvars.iv62, i64 %indvars.iv59
				%i1 = load double, double* %arrayidx27, align 8
				%mul = fmul fast double %i1, %i
				%arrayidx37 = getelementptr inbounds [8 x [4 x [1024 x [4 x double]]]], [8 x [4 x [1024 x [4 x double]]]]* %C, i64 %indvars.iv71, i64 %indvars.iv68, i64 %indvars.iv65, i64 %indvars.iv59, i64 %indvars.iv
				%i2 = load double, double* %arrayidx37, align 8
				%add = fadd fast double %i2, %mul
				store double %add, double* %arrayidx37, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 4
				br i1 %exitcond, label %for.body15, label %for.inc38

				for.inc38: ; preds = %for.body15
				%indvars.iv.next60 = add nuw nsw i64 %indvars.iv59, 1
				%exitcond61 = icmp ne i64 %indvars.iv.next60, 1024
				br i1 %exitcond61, label %for.cond13.preheader, label %for.inc41

				for.inc41: ; preds = %for.inc38
				%indvars.iv.next63 = add nuw nsw i64 %indvars.iv62, 1
				%exitcond64 = icmp ne i64 %indvars.iv.next63, 1024
				br i1 %exitcond64, label %for.cond10.preheader, label %for.inc44

				for.inc44: ; preds = %for.inc41
				%indvars.iv.next66 = add nuw nsw i64 %indvars.iv65, 1
				%exitcond67 = icmp ne i64 %indvars.iv.next66, 4
				br i1 %exitcond67, label %for.cond7.preheader, label %for.inc47

				for.inc47: ; preds = %for.inc44
				%indvars.iv.next69 = add nuw nsw i64 %indvars.iv68, 1
				%exitcond70 = icmp ne i64 %indvars.iv.next69, 8
				br i1 %exitcond70, label %for.cond4.preheader, label %for.inc50

				for.inc50: ; preds = %for.inc47
				%indvars.iv.next72 = add nuw nsw i64 %indvars.iv71, 1
				%exitcond73 = icmp ne i64 %indvars.iv.next72, 8
				br i1 %exitcond73, label %for.cond1.preheader, label %for.end52

				for.end52: ; preds = %for.inc50
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_2.ll

	; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -debug < %s 2>&1 \| FileCheck %s			; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug < %s 2>&1 \| FileCheck %s
	; REQUIRES: asserts			; REQUIRES: asserts
	;			;
	; /* C := alphaAB + betaC /			; /* C := alphaAB + betaC /
	; for (i = 0; i < _PB_NI; i++)			; for (i = 0; i < _PB_NI; i++)
	; for (j = 0; j < _PB_NJ; j += 2)			; for (j = 0; j < _PB_NJ; j += 2)
	; {			; {
	; C[i][j] *= beta;			; C[i][j] *= beta;
	; for (k = 0; k < _PB_NK; ++k)			; for (k = 0; k < _PB_NK; ++k)
	; C[i][j] += alpha * A[i][k] * B[k][j];			; C[i][j] += alpha * A[i][k] * B[k][j];
	; }			; }
	;			;
	; Check that we won’t detect the matrix multiplication pattern,			; Check that we won’t detect the matrix multiplication pattern,
	; if, for example, there are memory accesses that have stride 2			; if, for example, there are memory accesses that have stride 2
	; after the interchanging of loops.			; after the interchanging of loops.
	;			;
	; CHECK-NOT: The matrix multiplication pattern was detected			; CHECK-NOT: The matrix multiplication pattern was detected
				; CHECK-NOT: The tensor contraction pattern was detected
	;			;
	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-unknown"			target triple = "x86_64-unknown-unknown"

	define internal void @kernel_gemm(i32 %arg, i32 %arg1, i32 %arg2, double %arg3, double %arg4, [1056 x double]* %arg5, [1024 x double]* %arg6, [1056 x double]* %arg7) #0 {			define internal void @kernel_gemm(i32 %arg, i32 %arg1, i32 %arg2, double %arg3, double %arg4, [1056 x double]* %arg5, [1024 x double]* %arg6, [1056 x double]* %arg7) #0 {
	bb:			bb:
	br label %bb8			br label %bb8

	Show All 40 Lines

polly/test/ScheduleOptimizer/pattern-matching-based-opts_20.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (i = 0; i < 16; i++)
				; for (j = 0; j < 16; j++)
				; for (k = 0; k < 8; ++k)
				; for (l = 0; l < 1024; ++l)
				; for (w = 0; w < 8; ++w)
				; for (q = 0; q < 8; ++q)
				; for (x = 0; x < 8; ++x)
				; C[i][j][k][w][q][x] += A[l][x][j][k] * B[w][q][l][i];
				;
				; CHECK: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define internal void @kernel_tc([16 x [8 x [8 x [8 x [8 x double]]]]]* %C, [8 x [16 x [8 x double]]]* %A, [8 x [1024 x [16 x double]]]* %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc60, %entry
				%indvars.iv85 = phi i64 [ 0, %entry ], [ %indvars.iv.next86, %for.inc60 ]
				br label %for.cond4.preheader

				for.cond4.preheader: ; preds = %for.inc57, %for.cond1.preheader
				%indvars.iv82 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next83, %for.inc57 ]
				br label %for.cond7.preheader

				for.cond7.preheader: ; preds = %for.inc54, %for.cond4.preheader
				%indvars.iv79 = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next80, %for.inc54 ]
				br label %for.cond10.preheader

				for.cond10.preheader: ; preds = %for.inc51, %for.cond7.preheader
				%indvars.iv76 = phi i64 [ 0, %for.cond7.preheader ], [ %indvars.iv.next77, %for.inc51 ]
				br label %for.cond13.preheader

				for.cond13.preheader: ; preds = %for.inc48, %for.cond10.preheader
				%indvars.iv73 = phi i64 [ 0, %for.cond10.preheader ], [ %indvars.iv.next74, %for.inc48 ]
				br label %for.cond16.preheader

				for.cond16.preheader: ; preds = %for.inc45, %for.cond13.preheader
				%indvars.iv70 = phi i64 [ 0, %for.cond13.preheader ], [ %indvars.iv.next71, %for.inc45 ]
				br label %for.body18

				for.body18: ; preds = %for.body18, %for.cond16.preheader
				%indvars.iv = phi i64 [ 0, %for.cond16.preheader ], [ %indvars.iv.next, %for.body18 ]
				%arrayidx24 = getelementptr inbounds [8 x [16 x [8 x double]]], [8 x [16 x [8 x double]]]* %A, i64 %indvars.iv76, i64 %indvars.iv, i64 %indvars.iv82, i64 %indvars.iv79
				%i = load double, double* %arrayidx24, align 8
				%arrayidx32 = getelementptr inbounds [8 x [1024 x [16 x double]]], [8 x [1024 x [16 x double]]]* %B, i64 %indvars.iv73, i64 %indvars.iv70, i64 %indvars.iv76, i64 %indvars.iv85
				%i1 = load double, double* %arrayidx32, align 8
				%mul = fmul fast double %i1, %i
				%arrayidx44 = getelementptr inbounds [16 x [8 x [8 x [8 x [8 x double]]]]], [16 x [8 x [8 x [8 x [8 x double]]]]]* %C, i64 %indvars.iv85, i64 %indvars.iv82, i64 %indvars.iv79, i64 %indvars.iv73, i64 %indvars.iv70, i64 %indvars.iv
				%i2 = load double, double* %arrayidx44, align 8
				%add = fadd fast double %i2, %mul
				store double %add, double* %arrayidx44, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 8
				br i1 %exitcond, label %for.body18, label %for.inc45

				for.inc45: ; preds = %for.body18
				%indvars.iv.next71 = add nuw nsw i64 %indvars.iv70, 1
				%exitcond72 = icmp ne i64 %indvars.iv.next71, 8
				br i1 %exitcond72, label %for.cond16.preheader, label %for.inc48

				for.inc48: ; preds = %for.inc45
				%indvars.iv.next74 = add nuw nsw i64 %indvars.iv73, 1
				%exitcond75 = icmp ne i64 %indvars.iv.next74, 8
				br i1 %exitcond75, label %for.cond13.preheader, label %for.inc51

				for.inc51: ; preds = %for.inc48
				%indvars.iv.next77 = add nuw nsw i64 %indvars.iv76, 1
				%exitcond78 = icmp ne i64 %indvars.iv.next77, 1024
				br i1 %exitcond78, label %for.cond10.preheader, label %for.inc54

				for.inc54: ; preds = %for.inc51
				%indvars.iv.next80 = add nuw nsw i64 %indvars.iv79, 1
				%exitcond81 = icmp ne i64 %indvars.iv.next80, 8
				br i1 %exitcond81, label %for.cond7.preheader, label %for.inc57

				for.inc57: ; preds = %for.inc54
				%indvars.iv.next83 = add nuw nsw i64 %indvars.iv82, 1
				%exitcond84 = icmp ne i64 %indvars.iv.next83, 16
				br i1 %exitcond84, label %for.cond4.preheader, label %for.inc60

				for.inc60: ; preds = %for.inc57
				%indvars.iv.next86 = add nuw nsw i64 %indvars.iv85, 1
				%exitcond87 = icmp ne i64 %indvars.iv.next86, 16
				br i1 %exitcond87, label %for.cond1.preheader, label %for.end62

				for.end62: ; preds = %for.inc60
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_21.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (int i = 0; i < 32; i++)
				; for (int j = 0; j < 32; j++)
				; for (int l = 0; l < 32; l++)
				; for (int w = 0; w < 32; w++)
				; C[i][j] += A[i][l][w] * B[w][j][i];
				;
				; CHECK-NOT: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define void @foo([64 x double]* noundef %C, [64 x [64 x double]]* noundef %A, [64 x [64 x double]]* noundef %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc33, %entry
				%indvars.iv49 = phi i64 [ 0, %entry ], [ %indvars.iv.next50, %for.inc33 ]
				br label %for.cond5.preheader

				for.cond5.preheader: ; preds = %for.inc30, %for.cond1.preheader
				%indvars.iv45 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next46, %for.inc30 ]
				br label %for.cond9.preheader

				for.cond9.preheader: ; preds = %for.inc27, %for.cond5.preheader
				%indvars.iv41 = phi i64 [ 0, %for.cond5.preheader ], [ %indvars.iv.next42, %for.inc27 ]
				br label %for.body12

				for.body12: ; preds = %for.body12, %for.cond9.preheader
				%indvars.iv = phi i64 [ 0, %for.cond9.preheader ], [ %indvars.iv.next, %for.body12 ]
				%arrayidx16 = getelementptr inbounds [64 x [64 x double]], [64 x [64 x double]]* %A, i64 %indvars.iv49, i64 %indvars.iv41, i64 %indvars.iv
				%i = load double, double* %arrayidx16, align 8
				%arrayidx22 = getelementptr inbounds [64 x [64 x double]], [64 x [64 x double]]* %B, i64 %indvars.iv, i64 %indvars.iv45, i64 %indvars.iv49
				%i1 = load double, double* %arrayidx22, align 8
				%mul = fmul fast double %i1, %i
				%arrayidx26 = getelementptr inbounds [64 x double], [64 x double]* %C, i64 %indvars.iv49, i64 %indvars.iv45
				%i2 = load double, double* %arrayidx26, align 8
				%add = fadd fast double %i2, %mul
				store double %add, double* %arrayidx26, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 32
				br i1 %exitcond, label %for.body12, label %for.inc27

				for.inc27: ; preds = %for.body12
				%indvars.iv.next42 = add nuw nsw i64 %indvars.iv41, 1
				%exitcond44 = icmp ne i64 %indvars.iv.next42, 32
				br i1 %exitcond44, label %for.cond9.preheader, label %for.inc30

				for.inc30: ; preds = %for.inc27
				%indvars.iv.next46 = add nuw nsw i64 %indvars.iv45, 1
				%exitcond48 = icmp ne i64 %indvars.iv.next46, 32
				br i1 %exitcond48, label %for.cond5.preheader, label %for.inc33

				for.inc33: ; preds = %for.inc30
				%indvars.iv.next50 = add nuw nsw i64 %indvars.iv49, 1
				%exitcond52 = icmp ne i64 %indvars.iv.next50, 32
				br i1 %exitcond52, label %for.cond1.preheader, label %for.end35

				for.end35: ; preds = %for.inc33
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_22.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (int i = 0; i < 32; i++)
				; for (int j = 0; j < 32; j++)
				; for (int l = 0; l < 32; l++)
				; for (int w = 0; w < 32; w++)
				; C[i][j] += A[i][l][w] * B[w][j][i+3];
				;
				; CHECK-NOT: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define void @foo([64 x double]* noundef %C, [64 x [64 x double]]* noundef %A, [64 x [64 x double]]* noundef %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc34, %entry
				%indvars.iv50 = phi i64 [ 0, %entry ], [ %indvars.iv.next51, %for.inc34 ]
				br label %for.cond5.preheader

				for.cond5.preheader: ; preds = %for.inc31, %for.cond1.preheader
				%indvars.iv46 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next47, %for.inc31 ]
				br label %for.cond9.preheader

				for.cond9.preheader: ; preds = %for.inc28, %for.cond5.preheader
				%indvars.iv42 = phi i64 [ 0, %for.cond5.preheader ], [ %indvars.iv.next43, %for.inc28 ]
				br label %for.body12

				for.body12: ; preds = %for.body12, %for.cond9.preheader
				%indvars.iv = phi i64 [ 0, %for.cond9.preheader ], [ %indvars.iv.next, %for.body12 ]
				%arrayidx16 = getelementptr inbounds [64 x [64 x double]], [64 x [64 x double]]* %A, i64 %indvars.iv50, i64 %indvars.iv42, i64 %indvars.iv
				%i = load double, double* %arrayidx16, align 8
				%i1 = add nuw nsw i64 %indvars.iv50, 3
				%arrayidx22 = getelementptr inbounds [64 x [64 x double]], [64 x [64 x double]]* %B, i64 %indvars.iv, i64 %indvars.iv46, i64 %i1
				%i2 = load double, double* %arrayidx22, align 8
				%mul = fmul fast double %i2, %i
				%arrayidx26 = getelementptr inbounds [64 x double], [64 x double]* %C, i64 %indvars.iv50, i64 %indvars.iv46
				%i3 = load double, double* %arrayidx26, align 8
				%add27 = fadd fast double %i3, %mul
				store double %add27, double* %arrayidx26, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 32
				br i1 %exitcond, label %for.body12, label %for.inc28

				for.inc28: ; preds = %for.body12
				%indvars.iv.next43 = add nuw nsw i64 %indvars.iv42, 1
				%exitcond45 = icmp ne i64 %indvars.iv.next43, 32
				br i1 %exitcond45, label %for.cond9.preheader, label %for.inc31

				for.inc31: ; preds = %for.inc28
				%indvars.iv.next47 = add nuw nsw i64 %indvars.iv46, 1
				%exitcond49 = icmp ne i64 %indvars.iv.next47, 32
				br i1 %exitcond49, label %for.cond5.preheader, label %for.inc34

				for.inc34: ; preds = %for.inc31
				%indvars.iv.next51 = add nuw nsw i64 %indvars.iv50, 1
				%exitcond54 = icmp ne i64 %indvars.iv.next51, 32
				br i1 %exitcond54, label %for.cond1.preheader, label %for.end36

				for.end36: ; preds = %for.inc34
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_23.ll

This file was added.

				; RUN: opt %loadPolly -polly-delicm -polly-simplify -polly-opt-isl \
				; RUN: -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; Check that a region statement, which has the correct order of accesses, is not
				; detected.
				;
				; for (int i = 0; i < 32; i++)
				; for (int j = 0; j < 32; j++)
				; for (int k = 0; k < 32; k++) {
				; int c = C[i][j];
				; if (ijk < 10) {
				; C[i][j] = A[i][k] + B[k][j];
				; } else {
				; C[i][j] = c;
				; }
				; }
				;
				; CHECK-NOT: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; Function Attrs: nounwind uwtable
				define dso_local void @foo([64 x double]* noundef %C, [64 x double]* noundef %A, [64 x double]* noundef %B) #0 {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %entry, %for.inc34
				%indvars.iv48 = phi i64 [ 0, %entry ], [ %indvars.iv.next49, %for.inc34 ]
				br label %for.cond5.preheader

				for.cond5.preheader: ; preds = %for.cond1.preheader, %for.inc31
				%indvars.iv43 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next44, %for.inc31 ]
				br label %for.body8

				for.body8: ; preds = %for.cond5.preheader, %if.end
				%indvars.iv = phi i64 [ 0, %for.cond5.preheader ], [ %indvars.iv.next, %if.end ]
				%arrayidx10 = getelementptr inbounds [64 x double], [64 x double]* %C, i64 %indvars.iv48, i64 %indvars.iv43
				%0 = mul nuw nsw i64 %indvars.iv43, %indvars.iv48
				%1 = mul nuw nsw i64 %0, %indvars.iv
				%cmp12 = icmp ult i64 %1, 10
				br i1 %cmp12, label %if.then, label %if.else

				if.then: ; preds = %for.body8
				%arrayidx17 = getelementptr inbounds [64 x double], [64 x double]* %A, i64 %indvars.iv48, i64 %indvars.iv
				%2 = load double, double* %arrayidx17, align 8
				%arrayidx21 = getelementptr inbounds [64 x double], [64 x double]* %B, i64 %indvars.iv, i64 %indvars.iv43
				%3 = load double, double* %arrayidx21, align 8
				%add = fadd fast double %3, %2
				br label %if.end

				if.else: ; preds = %for.body8
				%4 = load double, double* %arrayidx10, align 8
				%conv = fptosi double %4 to i32
				%conv26 = sitofp i32 %conv to double
				br label %if.end

				if.end: ; preds = %if.else, %if.then
				%storemerge = phi double [ %conv26, %if.else ], [ %add, %if.then ]
				store double %storemerge, double* %arrayidx10, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 32
				br i1 %exitcond, label %for.body8, label %for.inc31

				for.inc31: ; preds = %if.end
				%indvars.iv.next44 = add nuw nsw i64 %indvars.iv43, 1
				%exitcond47 = icmp ne i64 %indvars.iv.next44, 32
				br i1 %exitcond47, label %for.cond5.preheader, label %for.inc34

				for.inc34: ; preds = %for.inc31
				%indvars.iv.next49 = add nuw nsw i64 %indvars.iv48, 1
				%exitcond51 = icmp ne i64 %indvars.iv.next49, 32
				br i1 %exitcond51, label %for.cond1.preheader, label %for.end36

				for.end36: ; preds = %for.inc34
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_24.ll

This file was added.

				; RUN: opt %loadPolly -polly-reschedule=0 -polly-opt-isl \
				; RUN: -polly-pattern-matching-based-opts=true -polly-tc-opt=true \
				; RUN: -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (i = 0; i < 1024; i++)
				; for (j = 0; j < 1024; j++)
				; for (l = 0; l < 64; ++l)
				; for (w = 0; w < 64; ++w)
				; C[i][j] += A[i][l][w] * B[w][j][l];
				;
				; CHECK: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define internal void @kernel_tc(i32 %ni, i32 %nj, i32 %nl, i32 %nq, i32 %nw, double %alpha, double %beta, [1024 x double]* %C, [64 x [64 x double]]* %A, [1024 x [64 x double]]* %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.inc30, %entry
				%indvars.iv43 = phi i64 [ 0, %entry ], [ %indvars.iv.next44, %for.inc30 ]
				br label %for.cond4.preheader

				for.cond4.preheader: ; preds = %for.inc27, %for.cond1.preheader
				%indvars.iv40 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next41, %for.inc27 ]
				br label %for.cond7.preheader

				for.cond7.preheader: ; preds = %for.inc24, %for.cond4.preheader
				%indvars.iv37 = phi i64 [ 0, %for.cond4.preheader ], [ %indvars.iv.next38, %for.inc24 ]
				br label %for.body9

				for.body9: ; preds = %for.body9, %for.cond7.preheader
				%indvars.iv = phi i64 [ 0, %for.cond7.preheader ], [ %indvars.iv.next, %for.body9 ]
				%arrayidx13 = getelementptr inbounds [64 x [64 x double]], [64 x [64 x double]]* %A, i64 %indvars.iv43, i64 %indvars.iv37, i64 %indvars.iv
				%i = load double, double* %arrayidx13, align 8
				%arrayidx19 = getelementptr inbounds [1024 x [64 x double]], [1024 x [64 x double]]* %B, i64 %indvars.iv, i64 %indvars.iv40, i64 %indvars.iv37
				%i1 = load double, double* %arrayidx19, align 8
				%mul = fmul fast double %i1, %i
				%arrayidx23 = getelementptr inbounds [1024 x double], [1024 x double]* %C, i64 %indvars.iv43, i64 %indvars.iv40
				%i2 = load double, double* %arrayidx23, align 8
				%add = fadd fast double %i2, %mul
				store double %add, double* %arrayidx23, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, 64
				br i1 %exitcond, label %for.body9, label %for.inc24

				for.inc24: ; preds = %for.body9
				%indvars.iv.next38 = add nuw nsw i64 %indvars.iv37, 1
				%exitcond39 = icmp ne i64 %indvars.iv.next38, 64
				br i1 %exitcond39, label %for.cond7.preheader, label %for.inc27

				for.inc27: ; preds = %for.inc24
				%indvars.iv.next41 = add nuw nsw i64 %indvars.iv40, 1
				%exitcond42 = icmp ne i64 %indvars.iv.next41, 1024
				br i1 %exitcond42, label %for.cond4.preheader, label %for.inc30

				for.inc30: ; preds = %for.inc27
				%indvars.iv.next44 = add nuw nsw i64 %indvars.iv43, 1
				%exitcond45 = icmp ne i64 %indvars.iv.next44, 1024
				br i1 %exitcond45, label %for.cond1.preheader, label %for.end32

				for.end32: ; preds = %for.inc30
				ret void
				}

polly/test/ScheduleOptimizer/pattern-matching-based-opts_25.ll

This file was added.

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
				; RUN: -polly-tc-opt=true -debug -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				;
				; for (int i = 0; i < 32; i++)
				; for (int j = 0; j < 32; j++)
				; for (int w = 0; w < 32; w++)
				; C[i][j] += A[i][w] * B[w][j][i];
				;
				; CHECK-NOT: The tensor contraction pattern was detected
				;
				target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"
				target triple = "arm64-apple-macosx12.0.0"

				define void @foo([64 x double]* noundef %C, [64 x double]* noundef %A, [64 x [64 x double]]* noundef %B) {
				entry:
				br label %for.cond1.preheader

				for.cond1.preheader: ; preds = %for.cond.cleanup3, %entry
				%indvars.iv45 = phi i64 [ 0, %entry ], [ %indvars.iv.next46, %for.cond.cleanup3 ]
				br label %for.cond5.preheader

				for.cond.cleanup: ; preds = %for.cond.cleanup3
				ret void

				for.cond5.preheader: ; preds = %for.cond.cleanup7, %for.cond1.preheader
				%indvars.iv41 = phi i64 [ 0, %for.cond1.preheader ], [ %indvars.iv.next42, %for.cond.cleanup7 ]
				%arrayidx20 = getelementptr inbounds [64 x double], [64 x double]* %C, i64 %indvars.iv45, i64 %indvars.iv41
				%.pre = load double, double* %arrayidx20, align 8
				br label %for.body8

				for.cond.cleanup3: ; preds = %for.cond.cleanup7
				%indvars.iv.next46 = add nuw nsw i64 %indvars.iv45, 1
				%exitcond48.not = icmp eq i64 %indvars.iv.next46, 32
				br i1 %exitcond48.not, label %for.cond.cleanup, label %for.cond1.preheader

				for.cond.cleanup7: ; preds = %for.body8
				%indvars.iv.next42 = add nuw nsw i64 %indvars.iv41, 1
				%exitcond44.not = icmp eq i64 %indvars.iv.next42, 32
				br i1 %exitcond44.not, label %for.cond.cleanup3, label %for.cond5.preheader

				for.body8: ; preds = %for.body8, %for.cond5.preheader
				%i = phi double [ %.pre, %for.cond5.preheader ], [ %i3, %for.body8 ]
				%indvars.iv = phi i64 [ 0, %for.cond5.preheader ], [ %indvars.iv.next, %for.body8 ]
				%arrayidx10 = getelementptr inbounds [64 x double], [64 x double]* %A, i64 %indvars.iv45, i64 %indvars.iv
				%i1 = load double, double* %arrayidx10, align 8
				%arrayidx16 = getelementptr inbounds [64 x [64 x double]], [64 x [64 x double]]* %B, i64 %indvars.iv, i64 %indvars.iv41, i64 %indvars.iv45
				%i2 = load double, double* %arrayidx16, align 8
				%i3 = tail call double @llvm.fmuladd.f64(double %i1, double %i2, double %i)
				store double %i3, double* %arrayidx20, align 8
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, 32
				br i1 %exitcond.not, label %for.cond.cleanup7, label %for.body8
				}

				declare double @llvm.fmuladd.f64(double, double, double)
				No newline at end of file

polly/test/ScheduleOptimizer/pattern-matching-based-opts_4.ll

	; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \			; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
	; RUN: -debug < %s 2>&1 \| FileCheck %s			; RUN: -debug -polly-tc-opt=true -disable-output < %s 2>&1 \| FileCheck %s
	; RUN: opt %loadPolly -polly-pattern-matching-based-opts=true \			; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true \
	; RUN: -polly-target-throughput-vector-fma=1 \			; RUN: -polly-target-throughput-vector-fma=1 \
	; RUN: -polly-target-latency-vector-fma=8 \			; RUN: -polly-target-latency-vector-fma=8 \
	; RUN: -polly-target-1st-cache-level-size=32768 \			; RUN: -polly-target-1st-cache-level-size=32768 \
	; RUN: -polly-target-vector-register-bitwidth=256 \			; RUN: -polly-target-vector-register-bitwidth=256 \
	; RUN: -polly-target-2nd-cache-level-size=262144 \			; RUN: -polly-target-2nd-cache-level-size=262144 -polly-print-ast \
	; RUN: -polly-opt-isl -polly-print-ast -disable-output < %s \| FileCheck %s --check-prefix=PATTERN-MATCHING-OPTS			; RUN: -polly-tc-opt=true -disable-output -polly-opt-isl < %s \| \
				; RUN: FileCheck %s --check-prefix=PATTERN-MATCHING-OPTS
	; REQUIRES: asserts			; REQUIRES: asserts
	;			;
	; C := A * B + C			; C := A * B + C
	; Check that the pattern matching optimizations can detect different			; Check that the pattern matching optimizations can detect different
	; permutations of GEMM loop and produce the correct ISL AST. In this case,			; permutations of GEMM loop and produce the correct ISL AST. In this case,
	; dimensions of band nodes can be implicitly permuted by the algorithm			; dimensions of band nodes can be implicitly permuted by the algorithm
	; applied during the schedule generation. It should be taken into the			; applied during the schedule generation. It should be taken into the
	; account during the pattern matching optimizations.			; account during the pattern matching optimizations.
	; for (i = 0; i < _PB_NI; i++)			; for (i = 0; i < _PB_NI; i++)
	; for (k = 0; k < _PB_NK; ++k)			; for (k = 0; k < _PB_NK; ++k)
	; for (j = 0; j < _PB_NJ; j++)			; for (j = 0; j < _PB_NJ; j++)
	; C[i][j] += A[i][k] * B[k][j];			; C[i][j] += A[i][k] * B[k][j];
	;			;
				; CHECK: The tensor contraction pattern was detected
	; CHECK: The matrix multiplication pattern was detected			; CHECK: The matrix multiplication pattern was detected
	;			;
	; PATTERN-MATCHING-OPTS: // 1st level tiling - Tiles			; PATTERN-MATCHING-OPTS: // 1st level tiling - Tiles
	; PATTERN-MATCHING-OPTS-NEXT: for (int c1 = 0; c1 <= 3; c1 += 1) {			; PATTERN-MATCHING-OPTS-NEXT: for (int c1 = 0; c1 <= 3; c1 += 1) {
	; PATTERN-MATCHING-OPTS-NEXT: for (int c3 = 256 * c1; c3 <= 256 * c1 + 255; c3 += 1)			; PATTERN-MATCHING-OPTS-NEXT: for (int c3 = 256 * c1; c3 <= 256 * c1 + 255; c3 += 1)
	; PATTERN-MATCHING-OPTS-NEXT: for (int c4 = 0; c4 <= 1023; c4 += 1)			; PATTERN-MATCHING-OPTS-NEXT: for (int c4 = 0; c4 <= 1023; c4 += 1)
	; PATTERN-MATCHING-OPTS-NEXT: CopyStmt_0(0, c3, c4);			; PATTERN-MATCHING-OPTS-NEXT: CopyStmt_0(0, c3, c4);
	; PATTERN-MATCHING-OPTS-NEXT: for (int c2 = 0; c2 <= 10; c2 += 1) {			; PATTERN-MATCHING-OPTS-NEXT: for (int c2 = 0; c2 <= 10; c2 += 1) {
	▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Generalize the pattern matching to the case of tensor contractions.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 450615

polly/include/polly/Support/ISLTools.h

polly/lib/Support/ISLTools.cpp

polly/lib/Transform/MatmulOptimizer.cpp

polly/lib/Transform/ScheduleOptimizer.cpp

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts-after-delicm_2.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_11.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_15.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_16.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_17.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_18.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_19.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_2.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_20.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_21.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_22.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_23.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_24.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_25.ll

polly/test/ScheduleOptimizer/pattern-matching-based-opts_4.ll

[Polly] Generalize the pattern matching to the case of tensor contractions.
ClosedPublic