This is an archive of the discontinued LLVM Phabricator instance.

[GSoC 2016] [Polly] [WIP] Apply all necessary tilings and interchangings to get a macro-kernel
ClosedPublic

Authored by gareevroman on Jun 18 2016, 10:40 AM.

Details

Summary

This is the second patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS macro-kernel by applying a combination of tiling and interchanging. In subsequent changes we will implement the packing transformation.

Diff Detail

Repository
rL LLVM

Event Timeline

gareevroman retitled this revision from to [GSoC 2016] [Polly] [WIP] Apply all necessary tilings and interchangings to get a macro-kernel.
gareevroman updated this object.
gareevroman added a subscriber: pollydev.

This is a draft of the second patch to apply the BLIS matmul optimization pattern on matmul kernels. I'm not sure how it is better to implement interchanging of loops using schedule trees. That's why I would be very grateful for your comments, feedback and ideas.

I think that there are two ways to do it. The first way is to split a band node two times and perform sink to get the desired order of dimensions. In this case, the subtree of the band node would probably cause issues, if, for example, it contains mark nodes. As far as I know, they can't be sinked and should be manually reinserted.

The second way is to manually interchange dimensions of a partial schedule, delete the band node and insert a band node, which contains a new partial schedule. I've implemented it in the patch. However, there is a possible issue, because isl_schedule_node_delete can't be applied to a band node with an anchored subtree and isl_schedule_node_insert_partial_schedule can't be be applied to the schedule tree, which has anchored nodes.

Meinersbur edited edge metadata.Jun 20 2016, 5:47 AM

Thanks for the patch and meaningful comments.

include/polly/ScheduleOptimizer.h
124 ↗(On Diff #61168)

vice versa

243–252 ↗(On Diff #61168)

Please don't copy-paste the same text. There is a @see tag to refer to other parts.

lib/Transform/ScheduleOptimizer.cpp
539 ↗(On Diff #61168)

also check for CacheLevelAssociativityDegrees.size() >=2 to avoid out-of-range errors?

553 ↗(On Diff #61168)

You can use

int MacroKernelParams[] = {Mc, Nc, Kc};

here. It is automatically converted to ArrayRef<int>.

560–562 ↗(On Diff #61168)

Style: No single return in an else branch. The following style seems to be established.

  if (c) {
    ...
    return X;
  }
  return Y;
}
577 ↗(On Diff #61168)

Add an empty line befire to make clear that what is coming is another section?

Thanks for the patch and meaningful comments.

Thank you for the comments! I'll try to address them soon.

gareevroman edited edge metadata.

Changed according to the comments.

gareevroman marked 3 inline comments as done.Jun 24 2016, 4:53 AM
gareevroman added inline comments.
lib/Transform/ScheduleOptimizer.cpp
539 ↗(On Diff #61776)

As far as I understand, it requires information about the first two levels of cache to determine all the parameters (Kc, Mc, Nc) of a macro-kernel. I've added a corresponding comment in the new version of the patch.

560–562 ↗(On Diff #61776)

Thanks. In the new version I've tried to also take into account "Use Early Exits and continue to Simplify Code"

578 ↗(On Diff #61776)

Yes, it's the reason. Probably, it's not good. I wasn't sure wether we should create a function to create a micro-kernel that contains only "std::vector<int> MicroKernelParams{Mr, Nr};" and "Node = applyRegisterTiling(Node, MicroKernelParams, 1);"

grosser edited edge metadata.Jul 11 2016, 6:47 AM

Hi Roman,

thanks for this patch and Michael thank you for your review! I added a couple of comments.

Best,
Tobias

include/polly/ScheduleOptimizer.h
111 ↗(On Diff #61776)

Not part of this commit, but please add in a separate commit (no review needed) full title/author information to this paper, in case this link becomes dead at some point.

119 ↗(On Diff #61776)

"interchanging The"

This seems incomplete.

121 ↗(On Diff #61776)

This paper is already referenced earlier. No need to reference it again.

I personally would suggest to use @see to reference your implementation of the BLIS macro kernel, which than contains the macro-kernel specific information -- in sufficient detail to be self-contained without the BLIS paper.

122 ↗(On Diff #61776)

Drop "it should be noticed". It does not add any additional information.

123 ↗(On Diff #61776)

Drop "should". It does not add any specific information.

129 ↗(On Diff #61776)

Make this a "TODO:"

129 ↗(On Diff #61776)

Make this a TODO:

247 ↗(On Diff #61776)

Can you document what "Mr" describes?

Also, the documentation of this function requires to have read the BLIS paper to understand what this parameter means. That seems not ideal. I suggest to expand this documentation to be (mostly) self contained.

(It seems you may have had more documentation here before. I suggest to move the documentation you have from optimizeMatMulPattern and add the @see in the other function).

lib/Transform/ScheduleOptimizer.cpp
138 ↗(On Diff #61776)

What about" -polly-target-cache-level-associativity - The associativity of each cache level"

and CacheLevelAssociativity

144 ↗(On Diff #61776)

What about "-polly-target-cache-level-sizes - The size of each cache level"?

and CacheLevelSizes

531 ↗(On Diff #61776)

Nice!

540 ↗(On Diff #61776)

than two.

554 ↗(On Diff #61776)

In the subsequent patch you pull out the parameter calculation in a helper function. I suggest to do this already in this patch as otherwise code is just unnecessarily moved around.

559 ↗(On Diff #61776)

Instead of just storing these parameters in a list of integers, I suggest to introduce a struct that allows us to encapsulate these parameters and access them by name.

struct MacroKernelParamsTy {                                                     
  int Mc;                                                                        
  int Nc;                                                                        
  int Kc;                                                                        
};                                                                               
                                                                                 
void foo() {                                                                     
  int Mc, Nc, Kc;                                                                
  struct MacroKernelParamsTy MacroKernelParams = {Mc, Nc, Kc};                   
}
560–562 ↗(On Diff #61776)

Nice!

578 ↗(On Diff #61776)

I am in favor of a new function. It might be mostly empty, but it helps to the code.

gareevroman edited edge metadata.

Hi Tobias,

Thank you for the comments! I've tried to address them. Should we pull out creation of the BLIS micro-kernel and its parameters in helper functions in a separate patch?

gareevroman marked 3 inline comments as not done.Jul 13 2016, 12:24 PM
gareevroman added inline comments.
include/polly/ScheduleOptimizer.h
128–129 ↗(On Diff #63844)

I'll try to do it soon.

The test case was missed.

grosser accepted this revision.Jul 15 2016, 6:14 AM
grosser edited edge metadata.

Hi Roman,

this patch looks very good to me now. Please commit changes that just improve the code structure or documentation of existing code separately.

We can probably still improve the documentation of the cost functions further, but for now this looks sufficient. After all the infrastructure is in place, I expect that we perform further experiments and performance tuning. The results of these efforts will -- at the end of your summer of code -- certainly help us to extend the documentation.

include/polly/ScheduleOptimizer.h
48 ↗(On Diff #63846)

Nice!

128–156 ↗(On Diff #63846)

Very nice documentation!

144 ↗(On Diff #63846)

"account in creation" -> "account during the creation of the BLIS of the BLIS kernels and the computation of their parameters."

155 ↗(On Diff #63846)

The node is required to successfully pass ScheduleTreeOptimizer::isMatrMultPattern

263 ↗(On Diff #63846)

The values of ...

277 ↗(On Diff #63846)

The values passed in MicroKernelParam are used ...

lib/Transform/ScheduleOptimizer.cpp
538 ↗(On Diff #63846)

Nice. This change and some of the changes to improve documentation can be committed separately.

548 ↗(On Diff #63846)

Its 2nd for "second" or 1st for "first, "1nd" does not make sense.

557 ↗(On Diff #63846)

We choose the Mr and Nr parameters of the micro kernel to be large enough such that no stalls

559 ↗(On Diff #63846)

the updates _of_ the resulting

561 ↗(On Diff #63846)

be as small as possible

583 ↗(On Diff #63846)

During the computation of the matrix multiplication,

586 ↗(On Diff #63846)

"It" sounds not right. What is it referring to?

This revision is now accepted and ready to land.Jul 15 2016, 6:14 AM
Meinersbur added inline comments.Jul 15 2016, 9:22 AM
lib/Transform/ScheduleOptimizer.cpp
535 ↗(On Diff #63846)

MicroKernelParamsTy,MacroKernelParamsTy are small enough to be passed by value.

gareevroman added inline comments.Jul 16 2016, 12:28 AM
lib/Transform/ScheduleOptimizer.cpp
535 ↗(On Diff #63846)

MicroKernelParamsTy,MacroKernelParamsTy are small enough to be passed by value.

Hi Michael,

shouldn't we follow advices from "Effective C++", according to LLVM Coding Standards?

As far as I understand, in item 20 it is stated that the only types for which we can reasonably assume that pass-by-value is inexpensive are build-in types and STL iterator and function object types.

Meinersbur added inline comments.Jul 19 2016, 9:50 AM
lib/Transform/ScheduleOptimizer.cpp
535 ↗(On Diff #63846)

LLVM has its own types that are neither STL or built-ins that are meant to be passes by value: ArrayRef, StringRef, PointerIntPair, CallSite, etc.

"Effective C++" in the LLVM Coding Standards is referenced as the source of some recommendations, not as implicit import of rules.

We can reasonable assume that passing these types by value is inexpensive as we declared them ourselves as POD with few members in its header.

Since C++11 it can be better to pass even heavy objects by value if the function is going to consume/take ownership using std::move().

However, please feel free to not change it if you prefer this style.

mreisinger added inline comments.
test/ScheduleOptimizer/pattern-matching-based-opts_3.ll
66 ↗(On Diff #63846)

"1nd" - see Tobias' note from above

70 ↗(On Diff #63846)

"1nd" - see Tobias' note from above

gareevroman added inline comments.Jul 20 2016, 2:56 AM
lib/Transform/ScheduleOptimizer.cpp
535 ↗(On Diff #63846)

Thanks for the explanation.

LLVM has its own types that are neither STL or built-ins that are meant to be passes by value: ArrayRef, StringRef, PointerIntPair, CallSite, etc.
"Effective C++" in the LLVM Coding Standards is referenced as the source of some recommendations, not as implicit import of rules.

As far as I understand, according to the description, LLVM Coding Standards should not be regarded as an absolute requirement to be followed in all instances and "Effective C++" is one of the two books, which are proposed as particularly important books for work. I just tried to understand how much they are important for us.

We can reasonable assume that passing these types by value is inexpensive as we declared them ourselves as POD with few members in its header.
Since C++11 it can be better to pass even heavy objects by value if the function is going to consume/take ownership using std::move().

Yes, I agree with it. My understanding is that in this case passing by reference can be a precaution, because implementation of a user-defined type can be changed in future.

However, please feel free to not change it if you prefer this style.

I prefer to follow LLVM Coding Standards. OK, I'll fix it so they are passed by value.

This revision was automatically updated to reflect the committed changes.

@grosser, @Meinersbur, @mreisinger, thank you for the comments! I've tried to address them in r276616 and r276627.