This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
polly/trunk/
-
trunk/
-
lib/Transform/
-
Transform/
-
ScheduleOptimizer.cpp
-
test/Isl/CodeGen/MemAccess/
-
Isl/
-
CodeGen/
-
MemAccess/
-
mat_mul_pattern_data_layout.ll

Differential D22187

[GSoC 2016] [Polly] [WIP] Perform replacement of access relations and creation of new arrays according to the packing transformation
ClosedPublic

Authored by gareevroman on Jul 9 2016, 6:58 AM.

Download Raw Diff

Details

Reviewers

Meinersbur
grosser
jdoerfert

Commits

rG1c892e91e389: Perform replacement of access relations and creation of new arrays according to…
rPLO278666: Perform replacement of access relations and creation of new arrays according to…
rL278666: Perform replacement of access relations and creation of new arrays according…

Summary

This is the third patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we perform replacement of the access relations and create empty arrays, which are steps to implement the packing transformation. In subsequent changes we will implement copying to created arrays.

Diff Detail

Repository: rL LLVM

Event Timeline

gareevroman updated this revision to Diff 63389.Jul 9 2016, 6:58 AM

gareevroman retitled this revision from to [GSoC 2016] [Polly] [WIP] Perform replacement of access relations according to the packing transformation.

gareevroman updated this object.

gareevroman added reviewers: grosser, jdoerfert, Meinersbur.

gareevroman added a subscriber: pollydev.

This is a draft of the third patch to apply the BLIS matmul optimization pattern on matmul kernels. It should probably be split to separate modification of the existent code (creation of macro kernel, determination its parameters http://reviews.llvm.org/D21491) and replacement of access relations.

gareevroman updated this object.Jul 9 2016, 6:59 AM

gareevroman added a parent revision: D21491: [GSoC 2016] [Polly] [WIP] Apply all necessary tilings and interchangings to get a macro-kernel.

Fix incomplete comments.

Fix wrong comments and names of variables.

Hi Roman,

thanks for your work. Even though this is still WIP, here already some first commtents:

A test cases is needed and will make reviewing this patch easier.

I am not sure I like the interface to replace memory accesses. It seems this interface takes a specific memory access relation and replaces all accesses that have this specific relation the new access relation. I understand that this might work, but wonder if another interface might not be more intuitive. One could e.g. provide a relation

A[i][j] -> New[x][y][z] which is applied to all accesses to remap all accesses to A to a new array. Would such an interface work for your transformation?

Best,
Tobias

Remaining comments are inline.

Best,
Tobias

include/polly/ScheduleOptimizer.h
247 ↗	(On Diff #63480)	Please either keep explicit parameters or use a struct with named fields to pass parameters as otherwise their function is not clear.
lib/Transform/ScheduleOptimizer.cpp
543 ↗	(On Diff #63480)	Nice. I like small functions!
551 ↗	(On Diff #63480)	Drop " It should be noted that". It does not carry information.
559 ↗	(On Diff #63480)	I would use Stmt instead of ScpStmt.
614 ↗	(On Diff #63393)	It seems sufficient to reference this paper once, rather than at so many places. Besides referencing the paper, it would be good to describe what these functions do. E.g what parameters they compute and what reasoning is behind this parameter computation. In the end, the comments and code should be self contained, such that people do not need to read the paper to get a first understanding of the code. MicroKernelParams: I am not a big fan of using just a list of int as parameter vector. I suggest to either give list and _name_ them explicitly or to define a struct with named fields where is each field is described.
676 ↗	(On Diff #63393)	I would use 'Stmt' instead of ScpStmt.
700 ↗	(On Diff #63393)	Using strings to get a first prototype is fine, but in the code that we commit I would prefer to build up these maps not by parsing strings, but by constructing these functions from isl. A common approach is to build an isl_aff first and then translate it to an isl_map.
711 ↗	(On Diff #63393)	Pulling getMicroKernelParams and getMacroKernelParams into their own functions is a good idea. It might be better to return the resulting structure as return value, though.

In D22187#480166, @grosser wrote:

Hi Roman,

thanks for your work. Even though this is still WIP, here already some first commtents:

Hi Tobias,

thanks for the comments!

A test cases is needed and will make reviewing this patch easier.

Maybe it should check that in case of matrix multiplication, Polly generates code, which is similar to following one:

%arg3.s2a.reload228 = load double, double* %arg3.s2a
%polly.access.cast.arg6229 = bitcast [1024 x double]* %arg6 to double*
%181 = mul nsw i64 1024, %polly.indvar95
%182 = mul nsw i64 4, %polly.indvar107
%183 = add nsw i64 %181, %182
%184 = add nsw i64 %183, 1
%polly.access.arg6230 = getelementptr double, double* %polly.access.cast.arg6229, i64 %184
%tmp17_p_scalar_231 = load double, double* %polly.access.arg6230, align 8, !alias.scope !4, !noalias !7
%p_tmp18232 = fmul double %tmp17_p_scalar_231, %arg3.s2a.reload228
%polly.access.cast.arg7233 = bitcast [1056 x double]* %arg7 to double*
%185 = mul nsw i64 2048, %polly.indvar95
%186 = mul nsw i64 8, %polly.indvar107
%187 = add nsw i64 %185, %186
%188 = add nsw i64 %187, 1

What do you think about it?

I am not sure I like the interface to replace memory accesses. It seems this interface takes a specific memory access relation and replaces all accesses that have this specific relation the new access relation. I understand that this might work, but wonder if another interface might not be more intuitive. One could e.g. provide a relation

A[i][j] -> New[x][y][z] which is applied to all accesses to remap all accesses to A to a new array. Would such an interface work for your transformation?

Yes, I think that an ability to remap specific memory accesses to A is unnecessary for the transformation and such an interfaces would work fine for it.

I am not sure I like the interface to replace memory accesses. It seems this interface takes a specific memory access relation and replaces all accesses that have this specific relation the new access relation. I understand that this might work, but wonder if another interface might not be more intuitive. One could e.g. provide a relation

A[i][j] -> New[x][y][z] which is applied to all accesses to remap all accesses to A to a new array. Would such an interface work for your transformation?

Yes, I think that an ability to remap specific memory accesses to A is unnecessary for the transformation and such an interfaces would work fine for it.

Sorry for the misprint. I think that it is be possible to remap specific memory accesses to A using such an interface.

Meinersbur added inline comments.Jul 14 2016, 7:05 AM

lib/Transform/ScheduleOptimizer.cpp
547 ↗	(On Diff #63480)	If with `the ScpStmt` you refer to the parameter with that name, use `@p ScpStmt`. Usually we use `Stmt` as variable name for a generic ScopStmt.
548–549 ↗	(On Diff #63480)	a form of NewMemAccessRel What form?
562 ↗	(On Diff #63480)	I find the style `The map, which` odd. Suggestion: A map describing/that describes/containing the new access relations.
567 ↗	(On Diff #63480)	Use C++11 range loops: `for (auto MemA : ScpStmt)` this will make `(*MemA)->` unnecessary.
568 ↗	(On Diff #63480)	`auto *AccessRelation =` to limit the variable's scope to inside the loop body; it's not used outside of it.
673 ↗	(On Diff #63480)	utilized -> implemented. Is it possible to have a summary of the algorithm here?
681–682 ↗	(On Diff #63480)	Suggestion: /// @param MicroKernelParams,MacroKernelParams Parameters of the BLIS micro kernel.

In D22187#483905, @gareevroman wrote:

In D22187#480166, @grosser wrote:

Hi Roman,

thanks for your work. Even though this is still WIP, here already some first commtents:

Hi Tobias,

thanks for the comments!

A test cases is needed and will make reviewing this patch easier.

Maybe it should check that in case of matrix multiplication, Polly generates code, which is similar to following one:

%arg3.s2a.reload228 = load double, double* %arg3.s2a
%polly.access.cast.arg6229 = bitcast [1024 x double]* %arg6 to double*
%181 = mul nsw i64 1024, %polly.indvar95
%182 = mul nsw i64 4, %polly.indvar107
%183 = add nsw i64 %181, %182
%184 = add nsw i64 %183, 1
%polly.access.arg6230 = getelementptr double, double* %polly.access.cast.arg6229, i64 %184
%tmp17_p_scalar_231 = load double, double* %polly.access.arg6230, align 8, !alias.scope !4, !noalias !7
%p_tmp18232 = fmul double %tmp17_p_scalar_231, %arg3.s2a.reload228
%polly.access.cast.arg7233 = bitcast [1056 x double]* %arg7 to double*
%185 = mul nsw i64 2048, %polly.indvar95
%186 = mul nsw i64 8, %polly.indvar107
%187 = add nsw i64 %185, %186
%188 = add nsw i64 %187, 1

What do you think about it?

Either this, or you dump the changed access functions in ScopInfo.

Best,
Tobias

I am not sure I like the interface to replace memory accesses. It seems this interface takes a specific memory access relation and replaces all accesses that have this specific relation the new access relation. I understand that this might work, but wonder if another interface might not be more intuitive. One could e.g. provide a relation

A[i][j] -> New[x][y][z] which is applied to all accesses to remap all accesses to A to a new array. Would such an interface work for your transformation?

Hi Tobias,

I'm afraid that the interface probably doesn't work for the transformation, because, as far as I understand, we need additional information about the access relation A to distinguish, for example, A[i][j] and A[i][k]. It seems that there is no way to pass such information using such an interface. Sorry, had to recognize this earlier.

Maybe we could add an overloaded version that accepts an access relation of the A or make this argument optional.

What do you think about it?

Hi Michael,

thanks for the comments! I'll try to take them into account.

Hi Roman,

I may not yet fully unserstand this issue, but don't you transform all accesses to the matrix. Assume you don't transpose a submatrix but the full matrix A would applying the mapping A[x0][x1] -> A2[x1][x0] on all accesses what you want to do. The same should hold when you are working with submatrixes, except that the mapping function is slightly more complicated.

In D22187#486206, @grosser wrote:

Hi Roman,

I may not yet fully unserstand this issue, but don't you transform all accesses to the matrix. Assume you don't transpose a submatrix but the full matrix A would applying the mapping A[x0][x1] -> A2[x1][x0] on all accesses what you want to do. The same should hold when you are working with submatrixes, except that the mapping function is slightly more complicated.

Yes, I think this is true.

I guess I figured out. Thanks.

It is not possible that, for example, A[i][j] and A[i][k] of the following access relations S1[i, j, k]->A[i][j] and S2[i, j, k]->A[i, k], which may correspond to the following code, have the same identifier. Consequently, we'll be able to remap only one of these two accesses, if we apply A[i][j] -> New[x][y][z], using the interface. Is it right?

for i = LB1, …, UB1 in steps of 1
  for j = LB2, …, UB2 in steps of 1
    …  = A[i][j] : S1
  endfor
  for k = LB3, …, UB3 in steps of 1
    …  = A[i][k] : S2
  endfor
endfor

I've planned to implement it through iteration over all memory access and remapping only those ranges of access relation that live in the same space with a domain of the provided relation. Probably the interface works fine, if we optimize matrix multiplication and should remap only, for example, S[i, j, k]->A[i, k], S[i, j, k]->B[k, j]. In our case, however, there is no limit for the number of access relations that should be remapped. We'll have to iterate over all memory access, because of every access relations that we want to transform. Maybe I don't understand something.

In D22187#486221, @gareevroman wrote:

In D22187#486206, @grosser wrote:

Hi Roman,

I may not yet fully unserstand this issue, but don't you transform all accesses to the matrix. Assume you don't transpose a submatrix but the full matrix A would applying the mapping A[x0][x1] -> A2[x1][x0] on all accesses what you want to do. The same should hold when you are working with submatrixes, except that the mapping function is slightly more complicated.

Yes, I think this is true.

I guess I figured out. Thanks.

It is not possible that, for example, A[i][j] and A[i][k] of the following access relations S1[i, j, k]->A[i][j] and S2[i, j, k]->A[i, k], which may correspond to the following code, have the same identifier. Consequently, we'll be able to remap only one of these two accesses, if we apply A[i][j] -> New[x][y][z], using the interface. Is it right?

Not sure what you mean by the same identifier. It still seems you want to apply the mapping only to a subset of the memory references by somehow identifying the ones on which you want to apply the mapping through the identifiers in their access function. Is this true?

for i = LB1, …, UB1 in steps of 1
  for j = LB2, …, UB2 in steps of 1
    …  = A[i][j] : S1
  endfor
  for k = LB3, …, UB3 in steps of 1
    …  = A[i][k] : S2
  endfor
endfor
I've planned to implement it through iteration over all memory access and remapping only those ranges of access relation that live in the same space with a domain of the provided relation. Probably the interface works fine, if we optimize matrix multiplication and should remap only, for example, S[i, j, k]->A[i, k], S[i, j, k]->B[k, j]. In our case, however, there is no limit for the number of access relations that should be remapped. We'll have to iterate over all memory access, because of every access relations that we want to transform. Maybe I don't understand something.

Not sure I understand this. Could you maybe post the ScopInfo before your mapping and the ScopInfo that you expect as a result. In fact, maybe you can even implement this transformation with the jscop interface (it might need some enhancements to support the introduction of new array accesses).

Best,
Tobi

In D22187#486278, @grosser wrote:

In D22187#486221, @gareevroman wrote:

In D22187#486206, @grosser wrote:

Hi Roman,

I may not yet fully unserstand this issue, but don't you transform all accesses to the matrix. Assume you don't transpose a submatrix but the full matrix A would applying the mapping A[x0][x1] -> A2[x1][x0] on all accesses what you want to do. The same should hold when you are working with submatrixes, except that the mapping function is slightly more complicated.

Yes, I think this is true.

I guess I figured out. Thanks.

It is not possible that, for example, A[i][j] and A[i][k] of the following access relations S1[i, j, k]->A[i][j] and S2[i, j, k]->A[i, k], which may correspond to the following code, have the same identifier. Consequently, we'll be able to remap only one of these two accesses, if we apply A[i][j] -> New[x][y][z], using the interface. Is it right?

Not sure what you mean by the same identifier. It still seems you want to apply the mapping only to a subset of the memory references by somehow identifying the ones on which you want to apply the mapping through the identifiers in their access function. Is this true?

I didn't want to apply the mapping only to a subset of the memory references. I just wanted to understand what we get if we apply the transformation to the example.

OK. Now I've realized that the example is incorrect, because it produces two SCoP statements and we apply the transformation to memory references of only one SCoP statement. Thanks. I think that I don't have any more questions related to it.

for i = LB1, …, UB1 in steps of 1
  for j = LB2, …, UB2 in steps of 1
    …  = A[i][j] : S1
  endfor
  for k = LB3, …, UB3 in steps of 1
    …  = A[i][k] : S2
  endfor
endfor
I've planned to implement it through iteration over all memory access and remapping only those ranges of access relation that live in the same space with a domain of the provided relation. Probably the interface works fine, if we optimize matrix multiplication and should remap only, for example, S[i, j, k]->A[i, k], S[i, j, k]->B[k, j]. In our case, however, there is no limit for the number of access relations that should be remapped. We'll have to iterate over all memory access, because of every access relations that we want to transform. Maybe I don't understand something.
Not sure I understand this. Could you maybe post the ScopInfo before your mapping and the ScopInfo that you expect as a result. In fact, maybe you can even implement this transformation with the jscop interface (it might need some enhancements to support the introduction of new array accesses).

This is a ScopInfo before mapping. It corresponds to general matrix multiplication. I think that we aren't interesting in some specific remapping and that's why the expected result is probably not important.

Function: kernel_gemm
Region: %10---%35
Max Loop Depth:  3
Invariant Accesses: {
}
Context:
{  :  }
Assumed Context:
{  :  }
Invalid Context:
{  : 1 = 0 }
Arrays {
    double MemRef_5[*][1056]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][1024]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][1056]; // Element size 8
}
Arrays (Bounds as pw_affs) {
    double MemRef_5[*][ { [] -> [(1056)] } ]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][ { [] -> [(1024)] } ]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][ { [] -> [(1056)] } ]; // Element size 8
}
Alias Groups (2):
    [[ <{ MemRef_7[(0), (0)] }, { MemRef_7[(1023), (1056)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
    [[ <{ MemRef_6[(0), (0)] }, { MemRef_6[(1055), (1024)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
Statements {
Stmt_12
        Domain :=
            { Stmt_12[i0, i1] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 };
        Schedule :=
            { Stmt_12[i0, i1] -> [i0, i1, 0, 0] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_12[i0, i1] -> MemRef_4[] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
Stmt_17
        Domain :=
            { Stmt_17[i0, i1, i2] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 and 0 <= i2 <= 1023 };
        Schedule :=
            { Stmt_17[i0, i1, i2] -> [i0, i1, 1, i2] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_17[i0, i1, i2] -> MemRef_3[] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
}

If I'm not mistaken, we want to remap:

ReadAccess := [Reduction Type: NONE] [Scalar: 0]
  { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };

ReadAccess := [Reduction Type: NONE] [Scalar: 0]
  { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };

However, if we modify gemm,

for (i = 0; i < _PB_NI; i++)
  for (j = 0; j < _PB_NJ; j++)
    {
      C[i][j] *= beta;
      for (k = 0; k < _PB_NK; ++k)
        C[i][j] += alpha * A[i][k] * B[k][j] * B1[k][j];
    }

we'll remap access to A[i][k], B[k][j] and also B1[k][j] (Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2], Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1], Stmt_18[i0, i1, i2] -> MemRef_8[i2, i1]), because such a statement also passes the checks. Consequently, there is no limit for the number of access relations that should be remapped.

As far as I understand, to remap access to a specific array A we should iterate over all memory access of a statement and remap only those ranges of access relation that live in the same space with a domain of the provided relation. However, since there is no limit for the number of arrays and their access that should be remapped it can be costly to remap them all.

In D22187#486455, @gareevroman wrote:

In D22187#486278, @grosser wrote:

In D22187#486221, @gareevroman wrote:

In D22187#486206, @grosser wrote:

[...]

Function: kernel_gemm
Region: %10---%35
Max Loop Depth:  3
Invariant Accesses: {
}
Context:
{  :  }
Assumed Context:
{  :  }
Invalid Context:
{  : 1 = 0 }
Arrays {
    double MemRef_5[*][1056]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][1024]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][1056]; // Element size 8
}
Arrays (Bounds as pw_affs) {
    double MemRef_5[*][ { [] -> [(1056)] } ]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][ { [] -> [(1024)] } ]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][ { [] -> [(1056)] } ]; // Element size 8
}
Alias Groups (2):
    [[ <{ MemRef_7[(0), (0)] }, { MemRef_7[(1023), (1056)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
    [[ <{ MemRef_6[(0), (0)] }, { MemRef_6[(1055), (1024)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
Statements {
Stmt_12
        Domain :=
            { Stmt_12[i0, i1] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 };
        Schedule :=
            { Stmt_12[i0, i1] -> [i0, i1, 0, 0] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_12[i0, i1] -> MemRef_4[] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
Stmt_17
        Domain :=
            { Stmt_17[i0, i1, i2] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 and 0 <= i2 <= 1023 };
        Schedule :=
            { Stmt_17[i0, i1, i2] -> [i0, i1, 1, i2] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_17[i0, i1, i2] -> MemRef_3[] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
}

If I'm not mistaken, we want to remap:

ReadAccess := [Reduction Type: NONE] [Scalar: 0]
  { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };

ReadAccess := [Reduction Type: NONE] [Scalar: 0]
  { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };

However, if we modify gemm,

for (i = 0; i < _PB_NI; i++)
  for (j = 0; j < _PB_NJ; j++)
    {
      C[i][j] *= beta;
      for (k = 0; k < _PB_NK; ++k)
        C[i][j] += alpha * A[i][k] * B[k][j] * B1[k][j];
    }

I am confused. What is the matrix B1 about. I would expect the inner statement to just be:

C[i][j] += alpha * A[i][k] * B[k][j];

Also, AFAIU only one of the two matrices A and B needs to be transposed. Consequently, I would expect that only one of them has to be mapped. You seem to want to map two arrays - MemRef_6 and MemRef_7. Why is this necessary?

Also, it seems we still don't understand each other well. The text above seems to be mostly discussing that it is hard to find the right array references to transform, but I do not understand yet if you mean it is -- even for you as a human -- difficult to understand how and what you want to map _or_ if you mean applying the mapping the way you want it to be is difficult. That's why I asked if you can send me the transformed ScopInfo as you would like it to have. Even though it might be trivial, it shows me _what_ you want to achieve. Having this information I can then help you more easily to explain, _how_ you can achieve this.

As far as I understand, to remap access to a specific array A we should iterate over all memory access of a statement and remap only those ranges of access relation that live in the same space with a domain of the provided relation.

That is what I think, yes.

However, since there is no limit for the number of arrays and their access that should be remapped it can be costly to remap them all.

I don't think iterating once over all memory accesses is costly in compile time. I would worry about this only if we have profiling results that suggest this is indeed an issue. For me the code here should be pretty simple:

void remapAccesses(ScopArrayInfo *ArrayToTransform, __isl_take isl_map Transformation)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == ArrayToTransform) {
          isl_map *AccessRelation = Acc.getAccessRelation();
          AccessRelation = isl_map_apply(AccessRelation, isl_map_copy(Transformation));
          Acc.setAccessRelation(AccessRelation);
     }
  isL_map_free(Transformation)
}

Best,
Tobias

Also, AFAIU only one of the two matrices A and B needs to be transposed. Consequently, I would expect that only one of them has to be mapped. You seem to want to map two arrays - MemRef_6 and MemRef_7. Why is this necessary?

My understanding is that it is necessary since we want to perform the packing transformation, which is necessary, for example, to determine optimal values of Nc, Mc and Kc parameters from [1].

Let's consider the algorithm described in [1]. It operates on matrices stored in column-major order. (In our case matrices stored in row-major order. I'll describe possible differences, which can occur)

We consider the matrix-matrix multiplication that has the following form C := AB + C, where the sizes of A, B, C are m x k, k x n, and m x n, respectively.

According to [1], we want to optimize it and get the following code ([1], p. 6)

Loop 1 for jc = 0, ..., n − 1 in steps of Nc
Loop 2   for pc = 0, ...,  k − 1 in steps of Kc
           B(pc : pc + Kc − 1, jc : jc + Nc − 1) → Bc // Pack into Bc
Loop 3     for ic = 0, ..., m − 1 in steps of Mc
             A(ic : ic + Mc − 1, pc : pc + Kc − 1) → Ac // Pack into Ac
Loop 4       for jr = 0, ..., Nc − 1 in steps of Nr
Loop 5         for ir = 0, ..., Mc − 1 in steps of Mr
Loop 6           for pr = 0, ..., Kc − 1 in steps of 1
                   Cc(ir : ir + Mr - 1, jr : jr + Nr - 1) += Ac(ir : ir + Mr - 1, pr) * Bc(pr, jr : jr + Nr - 1)
               endfor
             endfor
           endfor
         endfor
       endfor
     endfor

In case of row-major order:

Loop 1 for ic = 0, ..., m−1 in steps of mc
Loop 2   for pc = 0, ..., k−1 in steps of kc
           A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 3     for jc = 0, ..., n−1 in steps of nc
             B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 4       for ir = 0, ..., mc −1 in steps of mr
Loop 5         for jr = 0, ..., nc −1 in steps of nr
Loop 6           for pr = 0, ..., kc −1 in steps of 1
                   Cc(ir : ir + Mr - 1, jr : jr + Nr - 1) += Ac(ir : ir + Mr - 1, pr) * Bc(pr, jr : jr + Nr - 1)
                 endfor
               endfor
             endfor
           endfor
         endfor
       endfor

'B(pc : pc + Kc − 1, jc : jc + Nc − 1) → Bc' and 'A(ic : ic + Mc − 1, pc : pc + Kc − 1) → Ac' are packing of matrices B and A, respectively. It reorganizes the entries of A and B into blocks of Ac and Bc. The packing layout is illustrated in Figure 3 ([1], p. 6).

If I'm not mistaken, it is a data-layout transformation that requires to introduce a new one-dimensional array, copy data to the array and change memory access locations of the compute kernel to reference the array.

Let's assume that there are loop 7, which iterates over a micro-panel of width nr, and loop 8, which iterates over a micro-tile of length mr (they are unrolled in our implementations to get the BLIS micro-kernel). Consequently, access functions of Ac and Bc will be [o1, o2, o3, o4, o5, o6, o7, o8] -> [Mr * (o6 + Kc * o5) + o8] and [o1, o2, o3, o4, o5, o6, o7, o8] -> [Nr * (o6 + Kc * o4) + o7], respectively.

In case of row-major order they are [o0, o1, o2, o3, o4, o5, o6, o7, o8] -> [Mr * (o5 + Kc * o3) + o6] and [o0, o1, o2, o3, o4, o5, o6, o7, o8] -> [Nr * (o5 + Kc * o4) + o7].

Importance of packing is described in ([1], p. 6):

We access consecutive memory locations, because elements of Ac and Bc will be read. Accessing consecutive memory locations is usually faster.
Data from Ac and Bc can be aligned. Accessing aligned data is also typically faster.
It helps to preload data from Ac and Bc into certain cache levels to reduce the time to access elements (to do it, in particular, we determine parameters of Kc, Mc and Nc ([1], p. 10))

If I'm not mistaken, packing of A and B requires mapping of both access to A and B arrays (in previous case they are MemRef_6 and MemRef_7). Is it correct?

I am confused. What is the matrix B1 about. I would expect the inner statement to just be:
C[i][j] += alpha * A[i][k] * B[k][j];

C[i][j] += alpha * A[i][k] * B[k][j] * B1[k][j] also passes the checks. That's why B1[k][j] is an example of matrix, which can be also located in the inner statement.

In http://reviews.llvm.org/D20575 we decided to determine the following class of statements (it's a copy of its description)

" 1 there are exactly three input dimensions
2 all memory accesses of the statement will have stride 0 or 1, if we interchange loops (switch the variable used in the inner loop to the outer loop)
3 all memory accesses of the statement except from the last one, are read memory access and the last one is write memory access
4 all subscripts of the last memory access of the statement don’t contain the variable used in the inner loop

An algorithm from [1] was designed to optimize a subclass of the class and could possibly cause performance regression, if we try to apply it to all statements of the class. However, I think that determination of the class requires a simpler pattern matching in comparison to determination of statements that contain only matrix multiplication. Furthermore, I think that we can use an algorithm that is similar to the one presented in [1] to optimize statements from the class in future."

This class also contain 'C[i][j] += alpha * A[i][k] * B[k][j] * B1[k][j]'

I've planned to remap all access of form Matr1[i][k] and Matr2[k, j], where i, j, k are induction variables of, for example, the following loop nest:

for i = LB1, …, UB1 in steps of 1
  for j = LB2, …, UB2 in steps of 1
    for k = LB3, …, UB3 in steps of 1
      …
    endfor
  endfor
endfor

I think that it shouldn't affect on optimization of general matrix-matrix multiplication.

Also, it seems we still don't understand each other well. The text above seems to be mostly discussing that it is hard to find the right array references to transform, but I do not understand yet if you mean it is -- even for you as a human -- difficult to understand how and what you want to map _or_ if you mean applying the mapping the way you want it to be is difficult. That's why I asked if you can send me the transformed ScopInfo as you would like it to have. Even though it might be trivial, it shows me _what_ you want to achieve. Having this information I can then help you more easily to explain, _how_ you can achieve this.

I assumed that we were planning to remap all access of form Matr1[i][k] and Matr2[k, j]. Sorry, I should have clarified this moment earlier.

Consequently, It was difficult for me to understand how effectively remap access using the interface. It was also difficult to understand why we don't transpose a submatrixes.

Should we determine statements that contain only two access A[i][k] and B[k][j]?

I think that we could remap all access of form Matr1[i][k] and Matr2[k, j]. To reduce cost of remapping, we could limit the number of such accesses that a statement from a class should have. You wrote that it's possible to work with submatrixes. If the mapping function isn't too complicated, we could probably try to do it,

This is the transformed ScopInfo of general matrix-matrix multiplication, which I would like to have:

Function: kernel_gemm
Region: %10---%35
Max Loop Depth:  3
Invariant Accesses: {
}
Context:
{  :  }
Assumed Context:
{  :  }
Invalid Context:
{  : 1 = 0 }
Arrays {
    double MemRef_5[*][1056]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][1024]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][1056]; // Element size 8
}
Arrays (Bounds as pw_affs) {
    double MemRef_5[*][ { [] -> [(1056)] } ]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][ { [] -> [(1024)] } ]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][ { [] -> [(1056)] } ]; // Element size 8
}
Alias Groups (2):
    [[ <{ MemRef_7[(0), (0)] }, { MemRef_7[(1023), (1056)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
    [[ <{ MemRef_6[(0), (0)] }, { MemRef_6[(1055), (1024)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
Statements {
Stmt_12
        Domain :=
            { Stmt_12[i0, i1] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 };
        Schedule :=
            { Stmt_12[i0, i1] -> [0, o1, o2, i0 - 32o1, i1 - 32o2, 0, 0, 0, 0, 0, 0, 0, 0] : -31 + i0 <= 32o1 <= i0 and -31 + i1 <= 32o2 <= i1 };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_12[i0, i1] -> MemRef_4[] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
Stmt_17
        Domain :=
            { Stmt_17[i0, i1, i2] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 and 0 <= i2 <= 1023 };
        Schedule :=
            { Stmt_17[i0, i1, i2] -> [1, o1, o2, o3, o4, o5, i2 - 256o2, 0, 0, 0, i0 - 16o1 - 4o4, i1 - 96o3 - 8o5, 0] : -15 + i0 <= 16o1 <= i0 and -255 + i2 <= 256o2 <= i2 and -95 + i1 <= 96o3 <= i1 and -3 + i0 - 16o1 <= 4o4 <= i0 - 16o1 and -7 + i1 - 96o3 <= 8o5 <= i1 - 96o3 };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };
       new: { Stmt_17[i0, i1, i2] -> MemRef_6[o0] : 4*floor((-i0 + o0)/4) = -i0 + o0 and -3 - 4i2 + o0 + 4096*floor((i0)/16) - 1024*floor((i0)/4) <= 1024*floor((255 - i2)/256) <= -4i2 + o0 + 4096*floor((i0)/16) - 1024*floor((i0)/4) };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_17[i0, i1, i2] -> MemRef_3[] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };
       new: { Stmt_17[i0, i1, i2] -> MemRef_7[o0] : 8*floor((-i1 + o0)/8) = -i1 + o0 and -7 - 8i2 + o0 + 8192*floor((i0)/16) - 2048*floor((i0)/4) <= 2048*floor((255 - i2)/256) <= -8i2 + o0 + 8192*floor((i0)/16) - 2048*floor((i0)/4) };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
}

As far as I understand, to remap access to a specific array A we should iterate over all memory access of a statement and remap only those ranges of access relation that live in the same space with a domain of the provided relation.

That is what I think, yes.

However, since there is no limit for the number of arrays and their access that should be remapped it can be costly to remap them all.

I don't think iterating once over all memory accesses is costly in compile time. I would worry about this only if we have profiling results that suggest this is indeed an issue. For me the code here should be pretty simple:
void remapAccesses(ScopArrayInfo *ArrayToTransform, __isl_take isl_map Transformation)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == ArrayToTransform) {
          isl_map *AccessRelation = Acc.getAccessRelation();
          AccessRelation = isl_map_apply(AccessRelation, isl_map_copy(Transformation));
          Acc.setAccessRelation(AccessRelation);
     }
  isL_map_free(Transformation)
}

Yes, I think that in case of the general matrix multiplication it should be fine.

Ref.: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

A new version of the patch updated according to the comments.

As far as I understand, it is more intuitive to remap all accesses to the array to a new array using, for example, the following function:

void remapAccesses(ScopArrayInfo *ArrayToTransform, __isl_take isl_map Transformation)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == ArrayToTransform) {
          isl_map *AccessRelation = Acc.getAccessRelation();
          AccessRelation = isl_map_apply(AccessRelation, isl_map_copy(Transformation));
          Acc.setAccessRelation(AccessRelation);
     }
  isL_map_free(Transformation)
}

However, if I'm not mistaken, we should first distinguish accesses that will be remapped. For example, we should be able to distinguish, the following access relations, to perform the packing transformation, because their packing require different transformations:

S[i, j, k]->M1[i, k], S[i, j, k]->M2[k, j]

Consequently we should iterate over all memory accesses of a statement to find required relations, if we use the remapAccesses:

for (auto &Stmt: S)
  for  (auto &Acc : Stmt.accesses()) {
    if (Acc.getAccessRelation() == RequiredAcc) {
      remapAccesses(Acc.getScopArrayInfo(), Transformation)
    }
  }

I think that it's better avoid the redundant iteration over memory accesses and use the following function

void remapAccesses(RequiredAcc, __isl_take isl_map NewAccess)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == RequiredAcc)
          Acc.setAccessRelation(NewAccess)
}

We could probably also replace only access of a specified SCoP statement, because replacement of accesses of all statements can be implemented using, for example, the following function:

for (auto &Stmt: S)
  remapAccesses(RequiredAcc, NewAccess)

The described function is implemented in replaceMemAccRelations.

In this version the issue related to wrong access function of B type matrices is fixed and redundant functions are removed.

gareevroman added a child revision: D23260: [GSoC 2016] [Polly] [WIP] Perform copying to created arrays according to the packing transformation.Aug 8 2016, 6:00 AM

Hi Roman,

sorry for the delay. I am happy with the general direction of the patch, but have a couple of comments, some of them suggesting a slightly different structure in the patch.

Best,
Tobias

lib/Transform/ScheduleOptimizer.cpp
647 ↗	(On Diff #67149)	As written earlier, I am a little unconformable with an interface that uses an old access relation to identify the memory accesses to rewrite and immediately rewrites it. Possibly not very different, but easier to understand would for me to split this in two steps: Identify the MemoryAccess that we will rewrite. MemoryAccess *identifyAccess() Then just set the access relation for this memory access. This could even be part of optimizeMatMulPattern(). AccessA->setAccessRelation() To 1) you could add a description that explains how you identify the memory accesses.
658 ↗	(On Diff #67149)	I am surprised you create this access relation from scratch? I would expect that you identified the access in statement that is interesting and then just get the original access relation with Access::getAccessRelation(). Or do you use this access relation to identify the memory access? If this is the case, I would expect this to be called from identifyAccess()
667 ↗	(On Diff #67149)	that is specific TO matrix multiplication
673 ↗	(On Diff #67149)	Can you expand the documentation here and explain both the relation from the original space to the data space _and_ the relation from the scattering space to the data space. Can you also give more descriptive names to K, I, and J and explain what they are used for (can be brief).
711 ↗	(On Diff #67149)	The name setScopInfo() seems misleading. Also, I am surprised that the creation of a new ScopArrayInfo is hidden here. I would expect a function ScopArrayInfo createScopArray( const std::vector<unsigned> &DimSizes, const char Name) which just create the array. Later code then takes the ScopArrayInfo and sets the isl_map_tuple when the isl_map is created/updated. Also, should you not derive the type of the array from the type of the memory accesses instead of always set it to double?
724 ↗	(On Diff #67149)	MicKer sounds very cryptic, can we use MicroParams and MacroParams?
768 ↗	(On Diff #67149)	Why does this need to be specific to the BLIS kernel? If you want to bail out, I would suggest to directly do this in optimizeMatMulPattern and make getInductionVariablesSubstitution just about extracting the schedule map.
test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
1 ↗	(On Diff #67149)	Instead of testing that we generate the right IR, I wonder if it makes more sense to check this at the polyhedral level and put it into test/ScheduleOptimizer?

Hi Tobias,

thanks for the comments!

lib/Transform/ScheduleOptimizer.cpp
647 ↗	(On Diff #67149)	Sorry, I'm afraid I don't fully understand the interface. Can we pass some parameter to identifyAccess to specify the memory access that we want to get? As far as I understand, we should be able to distinguish, for example, the following access relations, to perform the packing transformation, because their packing require different transformations: S[i, j, k]->M1[i, k], S[i, j, k]->M2[k, j] Every SCoP that corresponds to a matrix multiplication contains both such accesses. Which access relation is returned by "identifyAccess"? If I'm not mistaken, you wrote that it's more intuitive to remap all accesses to the array to a new array using, for example, the following function: void remapAccesses(ScopArrayInfo ArrayToTransform, __isl_take isl_map Transformation) for (auto &Stmt: S) for (auto &Acc : Stmt.accesses()) if (Acc.getScopArrayInfo() == ArrayToTransform) { isl_map AccessRelation = Acc.getAccessRelation(); AccessRelation = isl_map_apply(AccessRelation, isl_map_copy(Transformation)); Acc.setAccessRelation(AccessRelation); } isL_map_free(Transformation) } I tried to explain why I didn't use this approach in the following comment: https://reviews.llvm.org/D22187#502358
711 ↗	(On Diff #67149)	The name setScopInfo() seems misleading. I'll try to come up with something more descriptive. Also, I am surprised that the creation of a new ScopArrayInfo is hidden here. I would expect a function ... which just create the array. Later code then takes the ScopArrayInfo and sets the isl_map_tuple when the isl_map is created/updated. We could probably use Scop::createScopArrayInfo(Type *ElementType, const std::string &BaseName, const std::vector<unsigned> &Sizes) for this purpose. I've just tried to factor out repeated code into setScopInfo. Also, should you not derive the type of the array from the type of the memory accesses instead of always set it to double? OK. Let's derive types of the arrays in this patch. I thought that we could add this in future.
724 ↗	(On Diff #67149)	Yes, I think that it would be better.
768 ↗	(On Diff #67149)	OK.
test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
1 ↗	(On Diff #67149)	We could probably check that we generate the right ScopInfo. What do you think about it?

Hi Roman,

some quick answers.

Best,
Tobias

lib/Transform/ScheduleOptimizer.cpp
647 ↗	(On Diff #67149)	Can we pass some parameter to identifyAccess to specify the memory access that we want to get? Yes. If you want to make it simple and similar to the code you have today, you could create the following functions: Access identifyAccessOfShape(isl_map ExpectedAccessRelation) { for (auto MemA : Stmt) { auto AccessRelation = MemA->getAccessRelation(); if (isl_map_is_equal(ExpectedAccessRelation, AccessRelation)) return Access; isl_map_free(AccessRelation); } return nullptr; } Access identifyAccessA() { Relation = getMatMulPatternOriginalAccessRelation(Stmt, ?, ?); return identifyAccessWithRelation(Stmt, Relation); } Access *identifyAccessB() { return getMatMulPatternOriginalAccessRelation(Stmt, ?, ?); Relation = getMatMulPatternOriginalAccessRelation(Stmt, ?, ?); return identifyAccessWithRelation(Stmt, Relation); } This is not very different, but I believe it will make the code easier to understand and will also allow you to get the types directly from the returned arrays.
711 ↗	(On Diff #67149)	We could derive types later, but my feeling is that if we get the interface right, this should really just be a one-line change.
test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
1 ↗	(On Diff #67149)	Yes, good idea.

A new version of the patch updated according to the comments.

gareevroman added inline comments.Aug 13 2016, 4:16 AM

test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
2 ↗	(On Diff #67960)	Sorry, I've realized that I can do it only manually. Could you please advise me how to dump a ScopInfo optimized using IslScheduleOptimizer? Could we implement it, if it is not available at the moment?

Hi Roman,

the code structure now looks a lot better. I have a lost round of comments on the documentation of your code. Most of them are minor improvements and suggestions to improve wording, which should be easy to apply. I would like to have a last look, but assume after the next iteration this is good to go.

lib/Transform/ScheduleOptimizer.cpp
622 ↗	(On Diff #67960)	@brief Identify a memory access through the shape of its memory access relation.
625 ↗	(On Diff #67960)	Identify the unique memory access in @p Stmt, that has an access relation equal to @p ExpectedAccessRelation.
630 ↗	(On Diff #67960)	"replaced" does not make sense here. We just find the memory access, but do not know how it will be used later. Maybe jut" "The access relation that identifies the memory access to identify,
633 ↗	(On Diff #67960)	nullptr in case there is no or more than one such access.
635 ↗	(On Diff #67960)	identifyAccessByAccessRelation
640 ↗	(On Diff #67960)	Call it 'Access' instead of 'MemA'. 'A' is too short to be easily understandable.
652 ↗	(On Diff #67960)	Please also return nullptr, in case more than one access matches the accessRelation. Just returning the "first" seems to be rather random.
655 ↗	(On Diff #67960)	specific to THE matrix multiplication pattern. <-- end with "." here.
662 ↗	(On Diff #67960)	@param Stmt The SCoP statement for which to generate the access relation.
663 ↗	(On Diff #67960)	@param I The index of the input dimension that is mapped to the first output dimension. @param J The index of the input dimension that is mapped to the second output dimension.
676 ↗	(On Diff #67960)	Identify the memory access ... Finish sentence with "."
678 ↗	(On Diff #67960)	Identify the memory access ...
681 ↗	(On Diff #67960)	contains THE memory accesses
691 ↗	(On Diff #67960)	Identify the memory access ...
693 ↗	(On Diff #67960)	Identify the memory access that
696 ↗	(On Diff #67960)	contains the memory accesses
706 ↗	(On Diff #67960)	Finish sentence with "."
732 ↗	(On Diff #67960)	The parameter names in the comments don't match the ones in the function definition/
749 ↗	(On Diff #67960)	Finish with ".".
753 ↗	(On Diff #67960)	array, and
783 ↗	(On Diff #67960)	I would suggest to identify AccessA and AccessB first and to bail out if any of the two is zero. Only if we successfully identified both, I would proceed with the transformation. This never leaves us in a state where we transformed one, but not the other.
794 ↗	(On Diff #67960)	Finish with "."
test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
2 ↗	(On Diff #67960)	You can leave this test for now. We can improve this in the next iteration.

Hi Tobias,

thanks for the comments! I've tried to address them in this version of the patch.

LGTM

lib/Transform/ScheduleOptimizer.cpp
627 ↗	(On Diff #68008)	THE memory access
630 ↗	(On Diff #68008)	Drop 'to identify'.

This revision is now accepted and ready to land.Aug 15 2016, 4:30 AM

Closed by commit rL278666: Perform replacement of access relations and creation of new arrays according… (authored by romangareev). · Explain WhyAug 15 2016, 5:31 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

polly/

trunk/

lib/

Transform/

ScheduleOptimizer.cpp

215 lines

test/

Isl/

CodeGen/

MemAccess/

mat_mul_pattern_data_layout.ll

86 lines

Diff 68024

polly/trunk/lib/Transform/ScheduleOptimizer.cpp

Show First 20 Lines • Show All 612 Lines • ▼ Show 20 Lines	double Cbc = static_cast<double>(MicroKernelParams.Nr * Kc * 8 *
CacheLevelAssociativity[1]) /		CacheLevelAssociativity[1]) /
CacheLevelSizes[1];		CacheLevelSizes[1];
int Mc = floor(MicroKernelParams.Mr / Cac);		int Mc = floor(MicroKernelParams.Mr / Cac);
int Nc =		int Nc =
floor((MicroKernelParams.Nr * (CacheLevelAssociativity[1] - 2)) / Cbc);		floor((MicroKernelParams.Nr * (CacheLevelAssociativity[1] - 2)) / Cbc);
return {Mc, Nc, Kc};		return {Mc, Nc, Kc};
}		}

		/// @brief Identify a memory access through the shape of its memory access
		/// relation.
		///
		/// Identify the unique memory access in @p Stmt, that has an access relation
		/// equal to @p ExpectedAccessRelation.
		///
		/// @param Stmt The SCoP statement that contains the memory accesses under
		/// consideration.
		/// @param ExpectedAccessRelation The access relation that identifies
		/// the memory access.
		/// @return The memory access of @p Stmt whose memory access relation is equal
		/// to @p ExpectedAccessRelation. nullptr in case there is no or more
		/// than one such access.
		MemoryAccess *
		identifyAccessByAccessRelation(ScopStmt *Stmt,
		__isl_take isl_map *ExpectedAccessRelation) {
		if (isl_map_has_tuple_id(ExpectedAccessRelation, isl_dim_out))
		ExpectedAccessRelation =
		isl_map_reset_tuple_id(ExpectedAccessRelation, isl_dim_out);
		MemoryAccess *IdentifiedAccess = nullptr;
		for (auto Access : Stmt) {
		auto *AccessRelation = Access->getAccessRelation();
		AccessRelation = isl_map_reset_tuple_id(AccessRelation, isl_dim_out);
		if (isl_map_is_equal(ExpectedAccessRelation, AccessRelation)) {
		if (IdentifiedAccess) {
		isl_map_free(AccessRelation);
		isl_map_free(ExpectedAccessRelation);
		return nullptr;
		}
		IdentifiedAccess = Access;
		}
		isl_map_free(AccessRelation);
		}
		isl_map_free(ExpectedAccessRelation);
		return IdentifiedAccess;
		}

		/// @brief Create an access relation that is specific to the matrix
		/// multiplication pattern.
		///
		/// Create an access relation of the following form:
		/// Stmt[O0, O1, O2]->[OI, OJ],
		/// where I is @p I, J is @J
		///
		/// @param Stmt The SCoP statement for which to generate the access relation.
		/// @param I The index of the input dimension that is mapped to the first output
		/// dimension.
		/// @param J The index of the input dimension that is mapped to the second
		/// output dimension.
		/// @return The specified access relation.
		__isl_give isl_map *
		getMatMulPatternOriginalAccessRelation(ScopStmt *Stmt, unsigned I, unsigned J) {
		auto *AccessRelSpace = isl_space_alloc(Stmt->getIslCtx(), 0, 3, 2);
		auto *AccessRel = isl_map_universe(AccessRelSpace);
		AccessRel = isl_map_equate(AccessRel, isl_dim_in, I, isl_dim_out, 0);
		AccessRel = isl_map_equate(AccessRel, isl_dim_in, J, isl_dim_out, 1);
		AccessRel = isl_map_set_tuple_id(AccessRel, isl_dim_in, Stmt->getDomainId());
		return AccessRel;
		}

		/// @brief Identify the memory access that corresponds to the access
		/// to the second operand of the matrix multiplication.
		///
		/// Identify the memory access that corresponds to the access
		/// to the matrix B of the matrix multiplication C = A x B.
		///
		/// @param Stmt The SCoP statement that contains the memory accesses
		/// under consideration.
		/// @return The memory access of @p Stmt that corresponds to the access
		/// to the second operand of the matrix multiplication.
		MemoryAccess identifyAccessA(ScopStmt Stmt) {
		auto *OriginalRel = getMatMulPatternOriginalAccessRelation(Stmt, 0, 2);
		return identifyAccessByAccessRelation(Stmt, OriginalRel);
		}

		/// @brief Identify the memory access that corresponds to the access
		/// to the first operand of the matrix multiplication.
		///
		/// Identify the memory access that corresponds to the access
		/// to the matrix A of the matrix multiplication C = A x B.
		///
		/// @param Stmt The SCoP statement that contains the memory accesses
		/// under consideration.
		/// @return The memory access of @p Stmt that corresponds to the access
		/// to the first operand of the matrix multiplication.
		MemoryAccess identifyAccessB(ScopStmt Stmt) {
		auto *OriginalRel = getMatMulPatternOriginalAccessRelation(Stmt, 2, 1);
		return identifyAccessByAccessRelation(Stmt, OriginalRel);
		}

		/// @brief Create an access relation that is specific to
		/// the matrix multiplication pattern.
		///
		/// Create an access relation of the following form:
		/// [O0, O1, O2, O3, O4, O5, O6, O7, O8] -> [0, O5 + K * OI, OJ],
		/// where K is @p Coeff, I is @p FirstDim, J is @p SecondDim.
		///
		/// It can be used, for example, to create relations that helps to consequently
		/// access elements of operands of a matrix multiplication after creation of
		/// the BLIS micro and macro kernels.
		///
		/// @see ScheduleTreeOptimizer::createMicroKernel
		/// @see ScheduleTreeOptimizer::createMacroKernel
		///
		/// Subsequently, the described access relation is applied to the range of
		/// @p MapOldIndVar, that is used to map original induction variables to
		/// the ones, which are produced by schedule transformations. It helps to
		/// define relations using a new space and, at the same time, keep them
		/// in the original one.
		///
		/// @param MapOldIndVar The relation, which maps original induction variables
		/// to the ones, which are produced by schedule
		/// transformations.
		/// @param Coeff The coefficient that is used to define the specified access
		/// relation.
		/// @param FirstDim, SecondDim The input dimensions that are used to define
		/// the specified access relation.
		/// @return The specified access relation.
		__isl_give isl_map getMatMulAccRel(__isl_take isl_map MapOldIndVar,
		unsigned Coeff, unsigned FirstDim,
		unsigned SecondDim) {
		auto *Ctx = isl_map_get_ctx(MapOldIndVar);
		auto *AccessRelSpace = isl_space_alloc(Ctx, 0, 9, 3);
		auto *AccessRel = isl_map_universe(isl_space_copy(AccessRelSpace));
		auto *ConstrSpace = isl_local_space_from_space(AccessRelSpace);
		auto *Constr = isl_constraint_alloc_equality(ConstrSpace);
		Constr = isl_constraint_set_coefficient_si(Constr, isl_dim_out, 1, -1);
		Constr = isl_constraint_set_coefficient_si(Constr, isl_dim_in, 5, 1);
		Constr =
		isl_constraint_set_coefficient_si(Constr, isl_dim_in, FirstDim, Coeff);
		AccessRel = isl_map_add_constraint(AccessRel, Constr);
		AccessRel = isl_map_fix_si(AccessRel, isl_dim_out, 0, 0);
		AccessRel = isl_map_equate(AccessRel, isl_dim_in, SecondDim, isl_dim_out, 2);
		return isl_map_apply_range(MapOldIndVar, AccessRel);
		}

		/// @brief Apply the packing transformation.
		///
		/// The packing transformation can be described as a data-layout
		/// transformation that requires to introduce a new array, copy data
		/// to the array, and change memory access locations of the compute kernel
		/// to reference the array.
		///
		/// @param Node The schedule node to be optimized.
		/// @param MapOldIndVar The relation, which maps original induction variables
		/// to the ones, which are produced by schedule
		/// transformations.
		/// @param MicroParams, MacroParams Parameters of the BLIS kernel
		/// to be taken into account.
		/// @return The optimized schedule node.
		static void optimizeDataLayoutMatrMulPattern(__isl_take isl_map *MapOldIndVar,
		MicroKernelParamsTy MicroParams,
		MacroKernelParamsTy MacroParams) {
		auto InputDimsId = isl_map_get_tuple_id(MapOldIndVar, isl_dim_in);
		auto Stmt = static_cast<ScopStmt >(isl_id_get_user(InputDimsId));
		isl_id_free(InputDimsId);
		MemoryAccess *MemAccessA = identifyAccessA(Stmt);
		MemoryAccess *MemAccessB = identifyAccessB(Stmt);
		if (!MemAccessA \|\| !MemAccessB) {
		isl_map_free(MapOldIndVar);
		return;
		}
		auto *AccRel =
		getMatMulAccRel(isl_map_copy(MapOldIndVar), MacroParams.Kc, 3, 6);
		unsigned FirstDimSize = MacroParams.Mc * MacroParams.Kc / MicroParams.Mr;
		unsigned SecondDimSize = MicroParams.Mr;
		auto *SAI = Stmt->getParent()->createScopArrayInfo(
		MemAccessA->getElementType(), "Packed_A", {FirstDimSize, SecondDimSize});
		AccRel = isl_map_set_tuple_id(AccRel, isl_dim_out, SAI->getBasePtrId());
		MemAccessA->setNewAccessRelation(AccRel);
		AccRel = getMatMulAccRel(MapOldIndVar, MacroParams.Kc, 4, 7);
		FirstDimSize = MacroParams.Nc * MacroParams.Kc / MicroParams.Nr;
		SecondDimSize = MicroParams.Nr;
		SAI = Stmt->getParent()->createScopArrayInfo(
		MemAccessB->getElementType(), "Packed_B", {FirstDimSize, SecondDimSize});
		AccRel = isl_map_set_tuple_id(AccRel, isl_dim_out, SAI->getBasePtrId());
		MemAccessB->setNewAccessRelation(AccRel);
		}

		/// @brief Get a relation mapping induction variables produced by schedule
		/// transformations to the original ones.
		///
		/// @param Node The schedule node produced as the result of creation
		/// of the BLIS kernels.
		/// @param MicroKernelParams, MacroKernelParams Parameters of the BLIS kernel
		/// to be taken into account.
		/// @return The relation mapping original induction variables to the ones
		/// produced by schedule transformation.
		/// @see ScheduleTreeOptimizer::createMicroKernel
		/// @see ScheduleTreeOptimizer::createMacroKernel
		/// @see getMacroKernelParams
		__isl_give isl_map *
		getInductionVariablesSubstitution(__isl_take isl_schedule_node *Node,
		MicroKernelParamsTy MicroKernelParams,
		MacroKernelParamsTy MacroKernelParams) {
		auto *Child = isl_schedule_node_get_child(Node, 0);
		auto *UnMapOldIndVar = isl_schedule_node_get_prefix_schedule_union_map(Child);
		isl_schedule_node_free(Child);
		auto *MapOldIndVar = isl_map_from_union_map(UnMapOldIndVar);
		if (isl_map_dim(MapOldIndVar, isl_dim_out) > 9)
		MapOldIndVar =
		isl_map_project_out(MapOldIndVar, isl_dim_out, 0,
		isl_map_dim(MapOldIndVar, isl_dim_out) - 9);
		return MapOldIndVar;
		}

__isl_give isl_schedule_node *ScheduleTreeOptimizer::optimizeMatMulPattern(		__isl_give isl_schedule_node *ScheduleTreeOptimizer::optimizeMatMulPattern(
__isl_take isl_schedule_node Node, const llvm::TargetTransformInfo TTI) {		__isl_take isl_schedule_node Node, const llvm::TargetTransformInfo TTI) {
assert(TTI && "The target transform info should be provided.");		assert(TTI && "The target transform info should be provided.");
auto MicroKernelParams = getMicroKernelParams(TTI);		auto MicroKernelParams = getMicroKernelParams(TTI);
auto MacroKernelParams = getMacroKernelParams(MicroKernelParams);		auto MacroKernelParams = getMacroKernelParams(MicroKernelParams);
Node = createMacroKernel(Node, MacroKernelParams);		Node = createMacroKernel(Node, MacroKernelParams);
Node = createMicroKernel(Node, MicroKernelParams);		Node = createMicroKernel(Node, MicroKernelParams);
		if (MacroKernelParams.Mc == 1 \|\| MacroKernelParams.Nc == 1 \|\|
		MacroKernelParams.Kc == 1)
		return Node;
		auto *MapOldIndVar = getInductionVariablesSubstitution(
		Node, MicroKernelParams, MacroKernelParams);
		if (!MapOldIndVar)
		return Node;
		optimizeDataLayoutMatrMulPattern(MapOldIndVar, MicroKernelParams,
		MacroKernelParams);
return Node;		return Node;
}		}

bool ScheduleTreeOptimizer::isMatrMultPattern(		bool ScheduleTreeOptimizer::isMatrMultPattern(
__isl_keep isl_schedule_node *Node) {		__isl_keep isl_schedule_node *Node) {
auto *PartialSchedule =		auto *PartialSchedule =
isl_schedule_node_band_get_partial_schedule_union_map(Node);		isl_schedule_node_band_get_partial_schedule_union_map(Node);
if (isl_schedule_node_band_n_member(Node) != 3 \|\|		if (isl_schedule_node_band_n_member(Node) != 3 \|\|
▲ Show 20 Lines • Show All 289 Lines • Show Last 20 Lines

polly/trunk/test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll

				; RUN: opt %loadPolly -polly-opt-isl -polly-pattern-matching-based-opts=true -polly-target-througput-vector-fma=1 -polly-target-latency-vector-fma=8 -polly-target-cache-level-associativity=8,8 -polly-target-cache-level-sizes=32768,262144 -polly-codegen -S < %s 2>&1 \| FileCheck %s
				;
				; /* C := alphaAB + betaC /
				; for (i = 0; i < _PB_NI; i++)
				; for (j = 0; j < _PB_NJ; j++)
				; {
				; C[i][j] *= beta;
				; for (k = 0; k < _PB_NK; ++k)
				; C[i][j] += alpha * A[i][k] * B[k][j];
				; }
				;
				; CHECK:define internal void @kernel_gemm(i32 %arg, i32 %arg1, i32 %arg2, double %arg3, double %arg4, [1056 x double]* %arg5, [1024 x double]* %arg6, [1056 x double]* %arg7) #0 {
				; CHECK:bb:
				; CHECK: %arg3.s2a = alloca double
				; CHECK: %arg4.s2a = alloca double
				; CHECK: %Packed_A = alloca [1024 x [4 x double]]
				; CHECK: %Packed_B = alloca [3072 x [8 x double]]
				; CHECK: br label %polly.split_new_and_old
				;
				; CHECK:polly.stmt.bb14398: ; preds = %polly.stmt.bb14379
				; CHECK: %arg3.s2a.reload399 = load double, double* %arg3.s2a
				; CHECK: %polly.access.cast.Packed_A400 = bitcast [1024 x [4 x double]]* %Packed_A to double*
				; CHECK: %243 = mul nsw i64 256, %polly.indvar95
				; CHECK: %244 = add nsw i64 %243, %polly.indvar107
				; CHECK: %polly.access.add.Packed_A401 = add nsw i64 0, %244
				; CHECK: %polly.access.mul.Packed_A402 = mul nsw i64 %polly.access.add.Packed_A401, 4
				; CHECK: %polly.access.add.Packed_A403 = add nsw i64 %polly.access.mul.Packed_A402, 2
				; CHECK: %polly.access.Packed_A404 = getelementptr double, double* %polly.access.cast.Packed_A400, i64 %polly.access.add.Packed_A403
				; CHECK: %tmp17_p_scalar_405 = load double, double* %polly.access.Packed_A404, align 8
				; CHECK: %p_tmp18406 = fmul double %tmp17_p_scalar_405, %arg3.s2a.reload399
				; CHECK: %polly.access.cast.Packed_B407 = bitcast [3072 x [8 x double]]* %Packed_B to double*
				; CHECK %245 = mul nsw i64 256, %polly.indvar101
				; CHECK %246 = add nsw i64 %245, %polly.indvar107
				; CHECK %polly.access.add.Packed_B408 = add nsw i64 0, %246
				; CHECK %polly.access.mul.Packed_B409 = mul nsw i64 %polly.access.add.Packed_B408, 8
				; CHECK %polly.access.add.Packed_B410 = add nsw i64 %polly.access.mul.Packed_B409, 0
				;
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-unknown"

				define internal void @kernel_gemm(i32 %arg, i32 %arg1, i32 %arg2, double %arg3, double %arg4, [1056 x double]* %arg5, [1024 x double]* %arg6, [1056 x double]* %arg7) #0 {
				bb:
				br label %bb8

				bb8: ; preds = %bb29, %bb
				%tmp = phi i64 [ 0, %bb ], [ %tmp30, %bb29 ]
				br label %bb9

				bb9: ; preds = %bb26, %bb8
				%tmp10 = phi i64 [ 0, %bb8 ], [ %tmp27, %bb26 ]
				%tmp11 = getelementptr inbounds [1056 x double], [1056 x double]* %arg5, i64 %tmp, i64 %tmp10
				%tmp12 = load double, double* %tmp11, align 8
				%tmp13 = fmul double %tmp12, %arg4
				store double %tmp13, double* %tmp11, align 8
				br label %bb14

				bb14: ; preds = %bb14, %bb9
				%tmp15 = phi i64 [ 0, %bb9 ], [ %tmp24, %bb14 ]
				%tmp16 = getelementptr inbounds [1024 x double], [1024 x double]* %arg6, i64 %tmp, i64 %tmp15
				%tmp17 = load double, double* %tmp16, align 8
				%tmp18 = fmul double %tmp17, %arg3
				%tmp19 = getelementptr inbounds [1056 x double], [1056 x double]* %arg7, i64 %tmp15, i64 %tmp10
				%tmp20 = load double, double* %tmp19, align 8
				%tmp21 = fmul double %tmp18, %tmp20
				%tmp22 = load double, double* %tmp11, align 8
				%tmp23 = fadd double %tmp22, %tmp21
				store double %tmp23, double* %tmp11, align 8
				%tmp24 = add nuw nsw i64 %tmp15, 1
				%tmp25 = icmp ne i64 %tmp24, 1024
				br i1 %tmp25, label %bb14, label %bb26

				bb26: ; preds = %bb14
				%tmp27 = add nuw nsw i64 %tmp10, 1
				%tmp28 = icmp ne i64 %tmp27, 1056
				br i1 %tmp28, label %bb9, label %bb29

				bb29: ; preds = %bb26
				%tmp30 = add nuw nsw i64 %tmp, 1
				%tmp31 = icmp ne i64 %tmp30, 1056
				br i1 %tmp31, label %bb8, label %bb32

				bb32: ; preds = %bb29
				ret void
				}

				attributes #0 = { nounwind uwtable "target-cpu"="x86-64" "target-features"="+aes,+avx,+cmov,+cx16,+fxsr,+mmx,+pclmul,+popcnt,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsaveopt" }

This is an archive of the discontinued LLVM Phabricator instance.

[GSoC 2016] [Polly] [WIP] Perform replacement of access relations and creation of new arrays according to the packing transformationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 68024

polly/trunk/lib/Transform/ScheduleOptimizer.cpp

polly/trunk/test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll

[GSoC 2016] [Polly] [WIP] Perform replacement of access relations and creation of new arrays according to the packing transformation
ClosedPublic