Page MenuHomePhabricator

[GSoC 2016] [Polly] [WIP] Perform replacement of access relations and creation of new arrays according to the packing transformation
ClosedPublic

Authored by gareevroman on Jul 9 2016, 6:58 AM.

Details

Summary

This is the third patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we perform replacement of the access relations and create empty arrays, which are steps to implement the packing transformation. In subsequent changes we will implement copying to created arrays.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This is a draft of the third patch to apply the BLIS matmul optimization pattern on matmul kernels. It should probably be split to separate modification of the existent code (creation of macro kernel, determination its parameters http://reviews.llvm.org/D21491) and replacement of access relations.

gareevroman updated this revision to Diff 63393.Jul 9 2016, 8:50 AM

Fix incomplete comments.

Fix wrong comments and names of variables.

grosser edited edge metadata.Jul 11 2016, 6:58 AM

Hi Roman,

thanks for your work. Even though this is still WIP, here already some first commtents:

  1. A test cases is needed and will make reviewing this patch easier.
  1. I am not sure I like the interface to replace memory accesses. It seems this interface takes a specific memory access relation and replaces all accesses that have this specific relation the new access relation. I understand that this might work, but wonder if another interface might not be more intuitive. One could e.g. provide a relation

A[i][j] -> New[x][y][z] which is applied to all accesses to remap all accesses to A to a new array. Would such an interface work for your transformation?

Best,
Tobias

Remaining comments are inline.

Best,
Tobias

include/polly/ScheduleOptimizer.h
247 ↗(On Diff #63480)

Please either keep explicit parameters or use a struct with named fields to pass parameters as otherwise their function is not clear.

lib/Transform/ScheduleOptimizer.cpp
543 ↗(On Diff #63480)

Nice. I like small functions!

551 ↗(On Diff #63480)

Drop " It should be noted that". It does not carry information.

559 ↗(On Diff #63480)

I would use Stmt instead of ScpStmt.

614 ↗(On Diff #63393)

It seems sufficient to reference this paper once, rather than at so many places.

Besides referencing the paper, it would be good to describe what these functions do. E.g what parameters they compute and what reasoning is behind this parameter computation. In the end, the comments and code should be self contained, such that people do not need to read the paper to get a first understanding of the code.

MicroKernelParams: I am not a big fan of using just a list of int as parameter vector. I suggest to either give list and _name_ them explicitly or to define a struct with named fields where is each field is described.

676 ↗(On Diff #63393)

I would use 'Stmt' instead of ScpStmt.

700 ↗(On Diff #63393)

Using strings to get a first prototype is fine, but in the code that we commit I would prefer to build up these maps not by parsing strings, but by constructing these functions from isl. A common approach is to build an isl_aff first and then translate it to an isl_map.

711 ↗(On Diff #63393)

Pulling getMicroKernelParams and getMacroKernelParams into their own functions is a good idea. It might be better to return the resulting structure as return value, though.

Hi Roman,

thanks for your work. Even though this is still WIP, here already some first commtents:

Hi Tobias,

thanks for the comments!

  1. A test cases is needed and will make reviewing this patch easier.

Maybe it should check that in case of matrix multiplication, Polly generates code, which is similar to following one:

%arg3.s2a.reload228 = load double, double* %arg3.s2a
%polly.access.cast.arg6229 = bitcast [1024 x double]* %arg6 to double*
%181 = mul nsw i64 1024, %polly.indvar95
%182 = mul nsw i64 4, %polly.indvar107
%183 = add nsw i64 %181, %182
%184 = add nsw i64 %183, 1
%polly.access.arg6230 = getelementptr double, double* %polly.access.cast.arg6229, i64 %184
%tmp17_p_scalar_231 = load double, double* %polly.access.arg6230, align 8, !alias.scope !4, !noalias !7
%p_tmp18232 = fmul double %tmp17_p_scalar_231, %arg3.s2a.reload228
%polly.access.cast.arg7233 = bitcast [1056 x double]* %arg7 to double*
%185 = mul nsw i64 2048, %polly.indvar95
%186 = mul nsw i64 8, %polly.indvar107
%187 = add nsw i64 %185, %186
%188 = add nsw i64 %187, 1

What do you think about it?

  1. I am not sure I like the interface to replace memory accesses. It seems this interface takes a specific memory access relation and replaces all accesses that have this specific relation the new access relation. I understand that this might work, but wonder if another interface might not be more intuitive. One could e.g. provide a relation

A[i][j] -> New[x][y][z] which is applied to all accesses to remap all accesses to A to a new array. Would such an interface work for your transformation?

Yes, I think that an ability to remap specific memory accesses to A is unnecessary for the transformation and such an interfaces would work fine for it.

  1. I am not sure I like the interface to replace memory accesses. It seems this interface takes a specific memory access relation and replaces all accesses that have this specific relation the new access relation. I understand that this might work, but wonder if another interface might not be more intuitive. One could e.g. provide a relation

A[i][j] -> New[x][y][z] which is applied to all accesses to remap all accesses to A to a new array. Would such an interface work for your transformation?

Yes, I think that an ability to remap specific memory accesses to A is unnecessary for the transformation and such an interfaces would work fine for it.

Sorry for the misprint. I think that it is be possible to remap specific memory accesses to A using such an interface.

Meinersbur added inline comments.Jul 14 2016, 7:05 AM
lib/Transform/ScheduleOptimizer.cpp
547 ↗(On Diff #63480)

If with the ScpStmt you refer to the parameter with that name, use @p ScpStmt.

Usually we use Stmt as variable name for a generic ScopStmt.

548–549 ↗(On Diff #63480)

a form of NewMemAccessRel

What form?

562 ↗(On Diff #63480)

I find the style The map, which odd.

Suggestion:

A map describing/that describes/containing the new access relations.
567 ↗(On Diff #63480)

Use C++11 range loops:
for (auto *MemA : *ScpStmt)
this will make (*MemA)-> unnecessary.

568 ↗(On Diff #63480)

auto *AccessRelation = to limit the variable's scope to inside the loop body; it's not used outside of it.

673 ↗(On Diff #63480)

utilized -> implemented. Is it possible to have a summary of the algorithm here?

681–682 ↗(On Diff #63480)

Suggestion:

/// @param MicroKernelParams,MacroKernelParams Parameters of the BLIS micro kernel.

Hi Roman,

thanks for your work. Even though this is still WIP, here already some first commtents:

Hi Tobias,

thanks for the comments!

  1. A test cases is needed and will make reviewing this patch easier.

Maybe it should check that in case of matrix multiplication, Polly generates code, which is similar to following one:

%arg3.s2a.reload228 = load double, double* %arg3.s2a
%polly.access.cast.arg6229 = bitcast [1024 x double]* %arg6 to double*
%181 = mul nsw i64 1024, %polly.indvar95
%182 = mul nsw i64 4, %polly.indvar107
%183 = add nsw i64 %181, %182
%184 = add nsw i64 %183, 1
%polly.access.arg6230 = getelementptr double, double* %polly.access.cast.arg6229, i64 %184
%tmp17_p_scalar_231 = load double, double* %polly.access.arg6230, align 8, !alias.scope !4, !noalias !7
%p_tmp18232 = fmul double %tmp17_p_scalar_231, %arg3.s2a.reload228
%polly.access.cast.arg7233 = bitcast [1056 x double]* %arg7 to double*
%185 = mul nsw i64 2048, %polly.indvar95
%186 = mul nsw i64 8, %polly.indvar107
%187 = add nsw i64 %185, %186
%188 = add nsw i64 %187, 1

What do you think about it?

Either this, or you dump the changed access functions in ScopInfo.

Best,
Tobias

  1. I am not sure I like the interface to replace memory accesses. It seems this interface takes a specific memory access relation and replaces all accesses that have this specific relation the new access relation. I understand that this might work, but wonder if another interface might not be more intuitive. One could e.g. provide a relation

A[i][j] -> New[x][y][z] which is applied to all accesses to remap all accesses to A to a new array. Would such an interface work for your transformation?

Hi Tobias,

I'm afraid that the interface probably doesn't work for the transformation, because, as far as I understand, we need additional information about the access relation A to distinguish, for example, A[i][j] and A[i][k]. It seems that there is no way to pass such information using such an interface. Sorry, had to recognize this earlier.

Maybe we could add an overloaded version that accepts an access relation of the A or make this argument optional.

What do you think about it?

Hi Michael,

thanks for the comments! I'll try to take them into account.

Hi Roman,

I may not yet fully unserstand this issue, but don't you transform all accesses to the matrix. Assume you don't transpose a submatrix but the full matrix A would applying the mapping A[x0][x1] -> A2[x1][x0] on all accesses what you want to do. The same should hold when you are working with submatrixes, except that the mapping function is slightly more complicated.

Hi Roman,

I may not yet fully unserstand this issue, but don't you transform all accesses to the matrix. Assume you don't transpose a submatrix but the full matrix A would applying the mapping A[x0][x1] -> A2[x1][x0] on all accesses what you want to do. The same should hold when you are working with submatrixes, except that the mapping function is slightly more complicated.

Yes, I think this is true.

I guess I figured out. Thanks.

It is not possible that, for example, A[i][j] and A[i][k] of the following access relations S1[i, j, k]->A[i][j] and S2[i, j, k]->A[i, k], which may correspond to the following code, have the same identifier. Consequently, we'll be able to remap only one of these two accesses, if we apply A[i][j] -> New[x][y][z], using the interface. Is it right?

for i = LB1, …, UB1 in steps of 1
  for j = LB2, …, UB2 in steps of 1
    …  = A[i][j] : S1
  endfor
  for k = LB3, …, UB3 in steps of 1
    …  = A[i][k] : S2
  endfor
endfor

I've planned to implement it through iteration over all memory access and remapping only those ranges of access relation that live in the same space with a domain of the provided relation. Probably the interface works fine, if we optimize matrix multiplication and should remap only, for example, S[i, j, k]->A[i, k], S[i, j, k]->B[k, j]. In our case, however, there is no limit for the number of access relations that should be remapped. We'll have to iterate over all memory access, because of every access relations that we want to transform. Maybe I don't understand something.

Hi Roman,

I may not yet fully unserstand this issue, but don't you transform all accesses to the matrix. Assume you don't transpose a submatrix but the full matrix A would applying the mapping A[x0][x1] -> A2[x1][x0] on all accesses what you want to do. The same should hold when you are working with submatrixes, except that the mapping function is slightly more complicated.

Yes, I think this is true.

I guess I figured out. Thanks.

It is not possible that, for example, A[i][j] and A[i][k] of the following access relations S1[i, j, k]->A[i][j] and S2[i, j, k]->A[i, k], which may correspond to the following code, have the same identifier. Consequently, we'll be able to remap only one of these two accesses, if we apply A[i][j] -> New[x][y][z], using the interface. Is it right?

Not sure what you mean by the same identifier. It still seems you want to apply the mapping only to a subset of the memory references by somehow identifying the ones on which you want to apply the mapping through the identifiers in their access function. Is this true?

for i = LB1, …, UB1 in steps of 1
  for j = LB2, …, UB2 in steps of 1
    …  = A[i][j] : S1
  endfor
  for k = LB3, …, UB3 in steps of 1
    …  = A[i][k] : S2
  endfor
endfor

I've planned to implement it through iteration over all memory access and remapping only those ranges of access relation that live in the same space with a domain of the provided relation. Probably the interface works fine, if we optimize matrix multiplication and should remap only, for example, S[i, j, k]->A[i, k], S[i, j, k]->B[k, j]. In our case, however, there is no limit for the number of access relations that should be remapped. We'll have to iterate over all memory access, because of every access relations that we want to transform. Maybe I don't understand something.

Not sure I understand this. Could you maybe post the ScopInfo before your mapping and the ScopInfo that you expect as a result. In fact, maybe you can even implement this transformation with the jscop interface (it might need some enhancements to support the introduction of new array accesses).

Best,
Tobi

Hi Roman,

I may not yet fully unserstand this issue, but don't you transform all accesses to the matrix. Assume you don't transpose a submatrix but the full matrix A would applying the mapping A[x0][x1] -> A2[x1][x0] on all accesses what you want to do. The same should hold when you are working with submatrixes, except that the mapping function is slightly more complicated.

Yes, I think this is true.

I guess I figured out. Thanks.

It is not possible that, for example, A[i][j] and A[i][k] of the following access relations S1[i, j, k]->A[i][j] and S2[i, j, k]->A[i, k], which may correspond to the following code, have the same identifier. Consequently, we'll be able to remap only one of these two accesses, if we apply A[i][j] -> New[x][y][z], using the interface. Is it right?

Not sure what you mean by the same identifier. It still seems you want to apply the mapping only to a subset of the memory references by somehow identifying the ones on which you want to apply the mapping through the identifiers in their access function. Is this true?

I didn't want to apply the mapping only to a subset of the memory references. I just wanted to understand what we get if we apply the transformation to the example.

OK. Now I've realized that the example is incorrect, because it produces two SCoP statements and we apply the transformation to memory references of only one SCoP statement. Thanks. I think that I don't have any more questions related to it.

for i = LB1, …, UB1 in steps of 1
  for j = LB2, …, UB2 in steps of 1
    …  = A[i][j] : S1
  endfor
  for k = LB3, …, UB3 in steps of 1
    …  = A[i][k] : S2
  endfor
endfor

I've planned to implement it through iteration over all memory access and remapping only those ranges of access relation that live in the same space with a domain of the provided relation. Probably the interface works fine, if we optimize matrix multiplication and should remap only, for example, S[i, j, k]->A[i, k], S[i, j, k]->B[k, j]. In our case, however, there is no limit for the number of access relations that should be remapped. We'll have to iterate over all memory access, because of every access relations that we want to transform. Maybe I don't understand something.

Not sure I understand this. Could you maybe post the ScopInfo before your mapping and the ScopInfo that you expect as a result. In fact, maybe you can even implement this transformation with the jscop interface (it might need some enhancements to support the introduction of new array accesses).

This is a ScopInfo before mapping. It corresponds to general matrix multiplication. I think that we aren't interesting in some specific remapping and that's why the expected result is probably not important.

Function: kernel_gemm
Region: %10---%35
Max Loop Depth:  3
Invariant Accesses: {
}
Context:
{  :  }
Assumed Context:
{  :  }
Invalid Context:
{  : 1 = 0 }
Arrays {
    double MemRef_5[*][1056]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][1024]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][1056]; // Element size 8
}
Arrays (Bounds as pw_affs) {
    double MemRef_5[*][ { [] -> [(1056)] } ]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][ { [] -> [(1024)] } ]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][ { [] -> [(1056)] } ]; // Element size 8
}
Alias Groups (2):
    [[ <{ MemRef_7[(0), (0)] }, { MemRef_7[(1023), (1056)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
    [[ <{ MemRef_6[(0), (0)] }, { MemRef_6[(1055), (1024)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
Statements {
Stmt_12
        Domain :=
            { Stmt_12[i0, i1] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 };
        Schedule :=
            { Stmt_12[i0, i1] -> [i0, i1, 0, 0] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_12[i0, i1] -> MemRef_4[] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
Stmt_17
        Domain :=
            { Stmt_17[i0, i1, i2] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 and 0 <= i2 <= 1023 };
        Schedule :=
            { Stmt_17[i0, i1, i2] -> [i0, i1, 1, i2] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_17[i0, i1, i2] -> MemRef_3[] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
}

If I'm not mistaken, we want to remap:

ReadAccess := [Reduction Type: NONE] [Scalar: 0]
  { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };

ReadAccess := [Reduction Type: NONE] [Scalar: 0]
  { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };

However, if we modify gemm,

for (i = 0; i < _PB_NI; i++)
  for (j = 0; j < _PB_NJ; j++)
    {
      C[i][j] *= beta;
      for (k = 0; k < _PB_NK; ++k)
        C[i][j] += alpha * A[i][k] * B[k][j] * B1[k][j];
    }

we'll remap access to A[i][k], B[k][j] and also B1[k][j] (Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2], Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1], Stmt_18[i0, i1, i2] -> MemRef_8[i2, i1]), because such a statement also passes the checks. Consequently, there is no limit for the number of access relations that should be remapped.

As far as I understand, to remap access to a specific array A we should iterate over all memory access of a statement and remap only those ranges of access relation that live in the same space with a domain of the provided relation. However, since there is no limit for the number of arrays and their access that should be remapped it can be costly to remap them all.

[...]

Function: kernel_gemm
Region: %10---%35
Max Loop Depth:  3
Invariant Accesses: {
}
Context:
{  :  }
Assumed Context:
{  :  }
Invalid Context:
{  : 1 = 0 }
Arrays {
    double MemRef_5[*][1056]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][1024]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][1056]; // Element size 8
}
Arrays (Bounds as pw_affs) {
    double MemRef_5[*][ { [] -> [(1056)] } ]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][ { [] -> [(1024)] } ]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][ { [] -> [(1056)] } ]; // Element size 8
}
Alias Groups (2):
    [[ <{ MemRef_7[(0), (0)] }, { MemRef_7[(1023), (1056)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
    [[ <{ MemRef_6[(0), (0)] }, { MemRef_6[(1055), (1024)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
Statements {
Stmt_12
        Domain :=
            { Stmt_12[i0, i1] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 };
        Schedule :=
            { Stmt_12[i0, i1] -> [i0, i1, 0, 0] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_12[i0, i1] -> MemRef_4[] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
Stmt_17
        Domain :=
            { Stmt_17[i0, i1, i2] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 and 0 <= i2 <= 1023 };
        Schedule :=
            { Stmt_17[i0, i1, i2] -> [i0, i1, 1, i2] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_17[i0, i1, i2] -> MemRef_3[] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
}

If I'm not mistaken, we want to remap:

ReadAccess := [Reduction Type: NONE] [Scalar: 0]
  { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };

ReadAccess := [Reduction Type: NONE] [Scalar: 0]
  { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };

However, if we modify gemm,

for (i = 0; i < _PB_NI; i++)
  for (j = 0; j < _PB_NJ; j++)
    {
      C[i][j] *= beta;
      for (k = 0; k < _PB_NK; ++k)
        C[i][j] += alpha * A[i][k] * B[k][j] * B1[k][j];
    }

we'll remap access to A[i][k], B[k][j] and also B1[k][j] (Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2], Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1], Stmt_18[i0, i1, i2] -> MemRef_8[i2, i1]), because such a statement also passes the checks. Consequently, there is no limit for the number of access relations that should be remapped.

I am confused. What is the matrix B1 about. I would expect the inner statement to just be:

C[i][j] += alpha * A[i][k] * B[k][j];

Also, AFAIU only one of the two matrices A and B needs to be transposed. Consequently, I would expect that only one of them has to be mapped. You seem to want to map two arrays - MemRef_6 and MemRef_7. Why is this necessary?

Also, it seems we still don't understand each other well. The text above seems to be mostly discussing that it is hard to find the right array references to transform, but I do not understand yet if you mean it is -- even for you as a human -- difficult to understand how and what you want to map _or_ if you mean applying the mapping the way you want it to be is difficult. That's why I asked if you can send me the transformed ScopInfo as you would like it to have. Even though it might be trivial, it shows me _what_ you want to achieve. Having this information I can then help you more easily to explain, _how_ you can achieve this.

As far as I understand, to remap access to a specific array A we should iterate over all memory access of a statement and remap only those ranges of access relation that live in the same space with a domain of the provided relation.

That is what I think, yes.

However, since there is no limit for the number of arrays and their access that should be remapped it can be costly to remap them all.

I don't think iterating once over all memory accesses is costly in compile time. I would worry about this only if we have profiling results that suggest this is indeed an issue. For me the code here should be pretty simple:

void remapAccesses(ScopArrayInfo *ArrayToTransform, __isl_take isl_map Transformation)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == ArrayToTransform) {
          isl_map *AccessRelation = Acc.getAccessRelation();
          AccessRelation = isl_map_apply(AccessRelation, isl_map_copy(Transformation));
          Acc.setAccessRelation(AccessRelation);
     }
  isL_map_free(Transformation)
}

Best,
Tobias

Also, AFAIU only one of the two matrices A and B needs to be transposed. Consequently, I would expect that only one of them has to be mapped. You seem to want to map two arrays - MemRef_6 and MemRef_7. Why is this necessary?

My understanding is that it is necessary since we want to perform the packing transformation, which is necessary, for example, to determine optimal values of Nc, Mc and Kc parameters from [1].

Let's consider the algorithm described in [1]. It operates on matrices stored in column-major order. (In our case matrices stored in row-major order. I'll describe possible differences, which can occur)

We consider the matrix-matrix multiplication that has the following form C := AB + C, where the sizes of A, B, C are m x k, k x n, and m x n, respectively.

According to [1], we want to optimize it and get the following code ([1], p. 6)

Loop 1 for jc = 0, ..., n − 1 in steps of Nc
Loop 2   for pc = 0, ...,  k − 1 in steps of Kc
           B(pc : pc + Kc − 1, jc : jc + Nc − 1) → Bc // Pack into Bc
Loop 3     for ic = 0, ..., m − 1 in steps of Mc
             A(ic : ic + Mc − 1, pc : pc + Kc − 1) → Ac // Pack into Ac
Loop 4       for jr = 0, ..., Nc − 1 in steps of Nr
Loop 5         for ir = 0, ..., Mc − 1 in steps of Mr
Loop 6           for pr = 0, ..., Kc − 1 in steps of 1
                   Cc(ir : ir + Mr - 1, jr : jr + Nr - 1) += Ac(ir : ir + Mr - 1, pr) * Bc(pr, jr : jr + Nr - 1)
               endfor
             endfor
           endfor
         endfor
       endfor
     endfor

In case of row-major order:

Loop 1 for ic = 0, ..., m−1 in steps of mc
Loop 2   for pc = 0, ..., k−1 in steps of kc
           A(ic : ic + mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 3     for jc = 0, ..., n−1 in steps of nc
             B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 4       for ir = 0, ..., mc −1 in steps of mr
Loop 5         for jr = 0, ..., nc −1 in steps of nr
Loop 6           for pr = 0, ..., kc −1 in steps of 1
                   Cc(ir : ir + Mr - 1, jr : jr + Nr - 1) += Ac(ir : ir + Mr - 1, pr) * Bc(pr, jr : jr + Nr - 1)
                 endfor
               endfor
             endfor
           endfor
         endfor
       endfor

'B(pc : pc + Kc − 1, jc : jc + Nc − 1) → Bc' and 'A(ic : ic + Mc − 1, pc : pc + Kc − 1) → Ac' are packing of matrices B and A, respectively. It reorganizes the entries of A and B into blocks of Ac and Bc. The packing layout is illustrated in Figure 3 ([1], p. 6).

If I'm not mistaken, it is a data-layout transformation that requires to introduce a new one-dimensional array, copy data to the array and change memory access locations of the compute kernel to reference the array.

Let's assume that there are loop 7, which iterates over a micro-panel of width nr, and loop 8, which iterates over a micro-tile of length mr (they are unrolled in our implementations to get the BLIS micro-kernel). Consequently, access functions of Ac and Bc will be [o1, o2, o3, o4, o5, o6, o7, o8] -> [Mr * (o6 + Kc * o5) + o8] and [o1, o2, o3, o4, o5, o6, o7, o8] -> [Nr * (o6 + Kc * o4) + o7], respectively.

In case of row-major order they are [o0, o1, o2, o3, o4, o5, o6, o7, o8] -> [Mr * (o5 + Kc * o3) + o6] and [o0, o1, o2, o3, o4, o5, o6, o7, o8] -> [Nr * (o5 + Kc * o4) + o7].

Importance of packing is described in ([1], p. 6):

  1. We access consecutive memory locations, because elements of Ac and Bc will be read. Accessing consecutive memory locations is usually faster.
  2. Data from Ac and Bc can be aligned. Accessing aligned data is also typically faster.
  3. It helps to preload data from Ac and Bc into certain cache levels to reduce the time to access elements (to do it, in particular, we determine parameters of Kc, Mc and Nc ([1], p. 10))

If I'm not mistaken, packing of A and B requires mapping of both access to A and B arrays (in previous case they are MemRef_6 and MemRef_7). Is it correct?

I am confused. What is the matrix B1 about. I would expect the inner statement to just be:

C[i][j] += alpha * A[i][k] * B[k][j];

C[i][j] += alpha * A[i][k] * B[k][j] * B1[k][j] also passes the checks. That's why B1[k][j] is an example of matrix, which can be also located in the inner statement.

In http://reviews.llvm.org/D20575 we decided to determine the following class of statements (it's a copy of its description)

" 1 there are exactly three input dimensions
2 all memory accesses of the statement will have stride 0 or 1, if we interchange loops (switch the variable used in the inner loop to the outer loop)
3 all memory accesses of the statement except from the last one, are read memory access and the last one is write memory access
4 all subscripts of the last memory access of the statement don’t contain the variable used in the inner loop

An algorithm from [1] was designed to optimize a subclass of the class and could possibly cause performance regression, if we try to apply it to all statements of the class. However, I think that determination of the class requires a simpler pattern matching in comparison to determination of statements that contain only matrix multiplication. Furthermore, I think that we can use an algorithm that is similar to the one presented in [1] to optimize statements from the class in future."

This class also contain 'C[i][j] += alpha * A[i][k] * B[k][j] * B1[k][j]'

I've planned to remap all access of form Matr1[i][k] and Matr2[k, j], where i, j, k are induction variables of, for example, the following loop nest:

for i = LB1, …, UB1 in steps of 1
  for j = LB2, …, UB2 in steps of 1
    for k = LB3, …, UB3 in steps of 1
      …
    endfor
  endfor
endfor

I think that it shouldn't affect on optimization of general matrix-matrix multiplication.

Also, it seems we still don't understand each other well. The text above seems to be mostly discussing that it is hard to find the right array references to transform, but I do not understand yet if you mean it is -- even for you as a human -- difficult to understand how and what you want to map _or_ if you mean applying the mapping the way you want it to be is difficult. That's why I asked if you can send me the transformed ScopInfo as you would like it to have. Even though it might be trivial, it shows me _what_ you want to achieve. Having this information I can then help you more easily to explain, _how_ you can achieve this.

I assumed that we were planning to remap all access of form Matr1[i][k] and Matr2[k, j]. Sorry, I should have clarified this moment earlier.

Consequently, It was difficult for me to understand how effectively remap access using the interface. It was also difficult to understand why we don't transpose a submatrixes.

Should we determine statements that contain only two access A[i][k] and B[k][j]?

I think that we could remap all access of form Matr1[i][k] and Matr2[k, j]. To reduce cost of remapping, we could limit the number of such accesses that a statement from a class should have. You wrote that it's possible to work with submatrixes. If the mapping function isn't too complicated, we could probably try to do it,

This is the transformed ScopInfo of general matrix-matrix multiplication, which I would like to have:

Function: kernel_gemm
Region: %10---%35
Max Loop Depth:  3
Invariant Accesses: {
}
Context:
{  :  }
Assumed Context:
{  :  }
Invalid Context:
{  : 1 = 0 }
Arrays {
    double MemRef_5[*][1056]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][1024]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][1056]; // Element size 8
}
Arrays (Bounds as pw_affs) {
    double MemRef_5[*][ { [] -> [(1056)] } ]; // Element size 8
    double MemRef_4; // Element size 8
    double MemRef_6[*][ { [] -> [(1024)] } ]; // Element size 8
    double MemRef_3; // Element size 8
    double MemRef_7[*][ { [] -> [(1056)] } ]; // Element size 8
}
Alias Groups (2):
    [[ <{ MemRef_7[(0), (0)] }, { MemRef_7[(1023), (1056)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
    [[ <{ MemRef_6[(0), (0)] }, { MemRef_6[(1055), (1024)] }> <{ MemRef_5[(0), (0)] }, { MemRef_5[(1055), (1056)] }> ]]
Statements {
Stmt_12
        Domain :=
            { Stmt_12[i0, i1] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 };
        Schedule :=
            { Stmt_12[i0, i1] -> [0, o1, o2, i0 - 32o1, i1 - 32o2, 0, 0, 0, 0, 0, 0, 0, 0] : -31 + i0 <= 32o1 <= i0 and -31 + i1 <= 32o2 <= i1 };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_12[i0, i1] -> MemRef_4[] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_12[i0, i1] -> MemRef_5[i0, i1] };
Stmt_17
        Domain :=
            { Stmt_17[i0, i1, i2] : 0 <= i0 <= 1055 and 0 <= i1 <= 1055 and 0 <= i2 <= 1023 };
        Schedule :=
            { Stmt_17[i0, i1, i2] -> [1, o1, o2, o3, o4, o5, i2 - 256o2, 0, 0, 0, i0 - 16o1 - 4o4, i1 - 96o3 - 8o5, 0] : -15 + i0 <= 16o1 <= i0 and -255 + i2 <= 256o2 <= i2 and -95 + i1 <= 96o3 <= i1 and -3 + i0 - 16o1 <= 4o4 <= i0 - 16o1 and -7 + i1 - 96o3 <= 8o5 <= i1 - 96o3 };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] };
       new: { Stmt_17[i0, i1, i2] -> MemRef_6[o0] : 4*floor((-i0 + o0)/4) = -i0 + o0 and -3 - 4i2 + o0 + 4096*floor((i0)/16) - 1024*floor((i0)/4) <= 1024*floor((255 - i2)/256) <= -4i2 + o0 + 4096*floor((i0)/16) - 1024*floor((i0)/4) };
        ReadAccess := [Reduction Type: NONE] [Scalar: 1]
            { Stmt_17[i0, i1, i2] -> MemRef_3[] };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_7[i2, i1] };
       new: { Stmt_17[i0, i1, i2] -> MemRef_7[o0] : 8*floor((-i1 + o0)/8) = -i1 + o0 and -7 - 8i2 + o0 + 8192*floor((i0)/16) - 2048*floor((i0)/4) <= 2048*floor((255 - i2)/256) <= -8i2 + o0 + 8192*floor((i0)/16) - 2048*floor((i0)/4) };
        ReadAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
        MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
            { Stmt_17[i0, i1, i2] -> MemRef_5[i0, i1] };
}

As far as I understand, to remap access to a specific array A we should iterate over all memory access of a statement and remap only those ranges of access relation that live in the same space with a domain of the provided relation.

That is what I think, yes.

However, since there is no limit for the number of arrays and their access that should be remapped it can be costly to remap them all.

I don't think iterating once over all memory accesses is costly in compile time. I would worry about this only if we have profiling results that suggest this is indeed an issue. For me the code here should be pretty simple:

void remapAccesses(ScopArrayInfo *ArrayToTransform, __isl_take isl_map Transformation)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == ArrayToTransform) {
          isl_map *AccessRelation = Acc.getAccessRelation();
          AccessRelation = isl_map_apply(AccessRelation, isl_map_copy(Transformation));
          Acc.setAccessRelation(AccessRelation);
     }
  isL_map_free(Transformation)
}

Yes, I think that in case of the general matrix multiplication it should be fine.

Ref.: [1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

gareevroman edited edge metadata.

A new version of the patch updated according to the comments.

As far as I understand, it is more intuitive to remap all accesses to the array to a new array using, for example, the following function:

void remapAccesses(ScopArrayInfo *ArrayToTransform, __isl_take isl_map Transformation)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == ArrayToTransform) {
          isl_map *AccessRelation = Acc.getAccessRelation();
          AccessRelation = isl_map_apply(AccessRelation, isl_map_copy(Transformation));
          Acc.setAccessRelation(AccessRelation);
     }
  isL_map_free(Transformation)
}

However, if I'm not mistaken, we should first distinguish accesses that will be remapped. For example, we should be able to distinguish, the following access relations, to perform the packing transformation, because their packing require different transformations:

S[i, j, k]->M1[i, k], S[i, j, k]->M2[k, j]

Consequently we should iterate over all memory accesses of a statement to find required relations, if we use the remapAccesses:

for (auto &Stmt: S)
  for  (auto &Acc : Stmt.accesses()) {
    if (Acc.getAccessRelation() == RequiredAcc) {
      remapAccesses(Acc.getScopArrayInfo(), Transformation)
    }
  }

I think that it's better avoid the redundant iteration over memory accesses and use the following function

void remapAccesses(RequiredAcc, __isl_take isl_map NewAccess)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == RequiredAcc)
          Acc.setAccessRelation(NewAccess)
}

We could probably also replace only access of a specified SCoP statement, because replacement of accesses of all statements can be implemented using, for example, the following function:

for (auto &Stmt: S)
  remapAccesses(RequiredAcc, NewAccess)

The described function is implemented in replaceMemAccRelations.

gareevroman updated this revision to Diff 67149.Aug 8 2016, 5:58 AM
gareevroman retitled this revision from [GSoC 2016] [Polly] [WIP] Perform replacement of access relations according to the packing transformation to [GSoC 2016] [Polly] [WIP] Perform replacement of access relations and creation of new arrays according to the packing transformation.
gareevroman updated this object.

In this version the issue related to wrong access function of B type matrices is fixed and redundant functions are removed.

Hi Roman,

sorry for the delay. I am happy with the general direction of the patch, but have a couple of comments, some of them suggesting a slightly different structure in the patch.

Best,
Tobias

lib/Transform/ScheduleOptimizer.cpp
647 ↗(On Diff #67149)

As written earlier, I am a little unconformable with an interface that uses an old access relation to identify the memory accesses to rewrite and immediately rewrites it.

Possibly not very different, but easier to understand would for me to split this in two steps:

  1. Identify the MemoryAccess that we will rewrite.

MemoryAccess *identifyAccess()

  1. Then just set the access relation for this memory access. This could even be part of optimizeMatMulPattern().

AccessA->setAccessRelation()

To 1) you could add a description that explains how you identify the memory accesses.

658 ↗(On Diff #67149)

I am surprised you create this access relation from scratch? I would expect that you identified the access in statement that is interesting and then just get the original access relation with Access::getAccessRelation(). Or do you use this access relation to identify the memory access? If this is the case, I would expect this to be called from identifyAccess()

667 ↗(On Diff #67149)

that is specific TO matrix multiplication

673 ↗(On Diff #67149)

Can you expand the documentation here and explain both the relation from the original space to the
data space _and_ the relation from the scattering space to the data space.

Can you also give more descriptive names to K, I, and J and explain what they are used for (can be brief).

711 ↗(On Diff #67149)

The name setScopInfo() seems misleading. Also, I am surprised that the creation of a new ScopArrayInfo is hidden here. I would expect a function

ScopArrayInfo *createScopArray(
                                const std::vector<unsigned> &DimSizes,
                                const char *Name)

which just create the array. Later code then takes the ScopArrayInfo and sets the isl_map_tuple when the isl_map is created/updated.

Also, should you not derive the type of the array from the type of the memory accesses instead of always set it to double?

724 ↗(On Diff #67149)

MicKer sounds very cryptic, can we use MicroParams and MacroParams?

768 ↗(On Diff #67149)

Why does this need to be specific to the BLIS kernel? If you want to bail out, I would suggest to directly do this in optimizeMatMulPattern and make getInductionVariablesSubstitution just about extracting the schedule map.

test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
1 ↗(On Diff #67149)

Instead of testing that we generate the right IR, I wonder if it makes more sense to check this at the polyhedral level and put it into test/ScheduleOptimizer?

Hi Tobias,

thanks for the comments!

lib/Transform/ScheduleOptimizer.cpp
647 ↗(On Diff #67149)

Sorry, I'm afraid I don't fully understand the interface.

Can we pass some parameter to identifyAccess to specify the memory access that we want to get?

As far as I understand, we should be able to distinguish, for example, the following access relations, to perform the packing transformation, because their packing require different transformations:

S[i, j, k]->M1[i, k], S[i, j, k]->M2[k, j]

Every SCoP that corresponds to a matrix multiplication contains both such accesses. Which access relation is returned by "identifyAccess"?

If I'm not mistaken, you wrote that it's more intuitive to remap all accesses to the array to a new array using, for example, the following function:

void remapAccesses(ScopArrayInfo *ArrayToTransform, __isl_take isl_map Transformation)

  for (auto &Stmt: S)
    for (auto &Acc : Stmt.accesses())
      if (Acc.getScopArrayInfo() == ArrayToTransform) {
          isl_map *AccessRelation = Acc.getAccessRelation();
          AccessRelation = isl_map_apply(AccessRelation, isl_map_copy(Transformation));
          Acc.setAccessRelation(AccessRelation);
     }
  isL_map_free(Transformation)
}

I tried to explain why I didn't use this approach in the following comment: https://reviews.llvm.org/D22187#502358

711 ↗(On Diff #67149)

The name setScopInfo() seems misleading.

I'll try to come up with something more descriptive.

Also, I am surprised that the creation of a new ScopArrayInfo is hidden here. I would expect a function

...

which just create the array. Later code then takes the ScopArrayInfo and sets the isl_map_tuple when the isl_map is created/updated.

We could probably use Scop::createScopArrayInfo(Type *ElementType, const std::string &BaseName, const std::vector<unsigned> &Sizes) for this purpose.

I've just tried to factor out repeated code into setScopInfo.

Also, should you not derive the type of the array from the type of the memory accesses instead of always set it to double?

OK. Let's derive types of the arrays in this patch. I thought that we could add this in future.

724 ↗(On Diff #67149)

Yes, I think that it would be better.

768 ↗(On Diff #67149)

OK.

test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
1 ↗(On Diff #67149)

We could probably check that we generate the right ScopInfo. What do you think about it?

Hi Roman,

some quick answers.

Best,
Tobias

lib/Transform/ScheduleOptimizer.cpp
647 ↗(On Diff #67149)

Can we pass some parameter to identifyAccess to specify the memory access that we want to get?

Yes. If you want to make it simple and similar to the code you have today, you could create the following functions:

Access *identifyAccessOfShape(isl_map *ExpectedAccessRelation) {
  for (auto *MemA : *Stmt) {
    auto *AccessRelation = MemA->getAccessRelation();
    if (isl_map_is_equal(ExpectedAccessRelation, AccessRelation))
      return Access;
    isl_map_free(AccessRelation);
  }

  return nullptr;
}

Access *identifyAccessA() {
  Relation = getMatMulPatternOriginalAccessRelation(Stmt, ?, ?);
  return identifyAccessWithRelation(Stmt, Relation);
}
Access *identifyAccessB() {
  return getMatMulPatternOriginalAccessRelation(Stmt, ?, ?);
  Relation = getMatMulPatternOriginalAccessRelation(Stmt, ?, ?);
  return identifyAccessWithRelation(Stmt, Relation);
}

This is not very different, but I believe it will make the code easier to understand and will also allow you to get the types directly from the returned arrays.

711 ↗(On Diff #67149)

We could derive types later, but my feeling is that if we get the interface right, this should really just be a one-line change.

test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
1 ↗(On Diff #67149)

Yes, good idea.

A new version of the patch updated according to the comments.

gareevroman added inline comments.Aug 13 2016, 4:16 AM
test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
2 ↗(On Diff #67960)

Sorry, I've realized that I can do it only manually. Could you please advise me how to dump a ScopInfo optimized using IslScheduleOptimizer? Could we implement it, if it is not available at the moment?

Hi Roman,

the code structure now looks a lot better. I have a lost round of comments on the documentation of your code. Most of them are minor improvements and suggestions to improve wording, which should be easy to apply. I would like to have a last look, but assume after the next iteration this is good to go.

lib/Transform/ScheduleOptimizer.cpp
622 ↗(On Diff #67960)

@brief Identify a memory access through the shape of its memory access relation.

625 ↗(On Diff #67960)

Identify the unique memory access in @p Stmt, that has an access relation equal to @p ExpectedAccessRelation.

630 ↗(On Diff #67960)

"replaced" does not make sense here. We just find the memory access, but do not know how it will be used later. Maybe jut"

"The access relation that identifies the memory access to identify,

633 ↗(On Diff #67960)

nullptr in case there is no or more than one such access.

635 ↗(On Diff #67960)

identifyAccessByAccessRelation

640 ↗(On Diff #67960)

Call it 'Access' instead of 'MemA'. 'A' is too short to be easily understandable.

652 ↗(On Diff #67960)

Please also return nullptr, in case more than one access matches the accessRelation. Just returning the "first" seems to be rather random.

655 ↗(On Diff #67960)

specific to THE matrix multiplication pattern. <-- end with "." here.

662 ↗(On Diff #67960)

@param Stmt The SCoP statement for which to generate the access relation.

663 ↗(On Diff #67960)

@param I The index of the input dimension that is mapped to the first output dimension.
@param J The index of the input dimension that is mapped to the second output dimension.

676 ↗(On Diff #67960)

Identify the memory access ...

Finish sentence with "."

678 ↗(On Diff #67960)

Identify the memory access ...

681 ↗(On Diff #67960)

contains THE memory accesses

691 ↗(On Diff #67960)

Identify the memory access ...

693 ↗(On Diff #67960)

Identify the memory access that

696 ↗(On Diff #67960)

contains the memory accesses

706 ↗(On Diff #67960)

Finish sentence with "."

732 ↗(On Diff #67960)

The parameter names in the comments don't match the ones in the function definition/

749 ↗(On Diff #67960)

Finish with ".".

753 ↗(On Diff #67960)

array, and

783 ↗(On Diff #67960)

I would suggest to identify AccessA and AccessB first and to bail out if any of the two is zero. Only if we successfully identified both, I would proceed with the transformation. This never leaves us in a state where we transformed one, but not the other.

794 ↗(On Diff #67960)

Finish with "."

test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll
2 ↗(On Diff #67960)

You can leave this test for now. We can improve this in the next iteration.

Hi Tobias,

thanks for the comments! I've tried to address them in this version of the patch.

grosser accepted this revision.Aug 15 2016, 4:30 AM
grosser edited edge metadata.

LGTM

lib/Transform/ScheduleOptimizer.cpp
627 ↗(On Diff #68008)

THE memory access

630 ↗(On Diff #68008)

Drop 'to identify'.

This revision is now accepted and ready to land.Aug 15 2016, 4:30 AM
This revision was automatically updated to reflect the committed changes.