This is an archive of the discontinued LLVM Phabricator instance.

[GSoC 2016] [Polly] [WIP] Apply all necessary tilings and interchangings to get a macro-kernel
ClosedPublic

Authored by gareevroman on Jun 18 2016, 10:40 AM.

Details

Summary

This is the second patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS macro-kernel by applying a combination of tiling and interchanging. In subsequent changes we will implement the packing transformation.

Diff Detail

Event Timeline

gareevroman retitled this revision from to [GSoC 2016] [Polly] [WIP] Apply all necessary tilings and interchangings to get a macro-kernel.
gareevroman updated this object.
gareevroman added a subscriber: pollydev.

This is a draft of the second patch to apply the BLIS matmul optimization pattern on matmul kernels. I'm not sure how it is better to implement interchanging of loops using schedule trees. That's why I would be very grateful for your comments, feedback and ideas.

I think that there are two ways to do it. The first way is to split a band node two times and perform sink to get the desired order of dimensions. In this case, the subtree of the band node would probably cause issues, if, for example, it contains mark nodes. As far as I know, they can't be sinked and should be manually reinserted.

The second way is to manually interchange dimensions of a partial schedule, delete the band node and insert a band node, which contains a new partial schedule. I've implemented it in the patch. However, there is a possible issue, because isl_schedule_node_delete can't be applied to a band node with an anchored subtree and isl_schedule_node_insert_partial_schedule can't be be applied to the schedule tree, which has anchored nodes.

Meinersbur edited edge metadata.Jun 20 2016, 5:47 AM

Thanks for the patch and meaningful comments.

include/polly/ScheduleOptimizer.h
134

vice versa

261–270

Please don't copy-paste the same text. There is a @see tag to refer to other parts.

lib/Transform/ScheduleOptimizer.cpp
539

also check for CacheLevelAssociativityDegrees.size() >=2 to avoid out-of-range errors?

553

You can use

int MacroKernelParams[] = {Mc, Nc, Kc};

here. It is automatically converted to ArrayRef<int>.

560–562

Style: No single return in an else branch. The following style seems to be established.

  if (c) {
    ...
    return X;
  }
  return Y;
}
579

Add an empty line befire to make clear that what is coming is another section?

Thanks for the patch and meaningful comments.

Thank you for the comments! I'll try to address them soon.

gareevroman edited edge metadata.

Changed according to the comments.

gareevroman marked 3 inline comments as done.Jun 24 2016, 4:53 AM
gareevroman added inline comments.
lib/Transform/ScheduleOptimizer.cpp
539

As far as I understand, it requires information about the first two levels of cache to determine all the parameters (Kc, Mc, Nc) of a macro-kernel. I've added a corresponding comment in the new version of the patch.

560–562

Thanks. In the new version I've tried to also take into account "Use Early Exits and continue to Simplify Code"

578

Yes, it's the reason. Probably, it's not good. I wasn't sure wether we should create a function to create a micro-kernel that contains only "std::vector<int> MicroKernelParams{Mr, Nr};" and "Node = applyRegisterTiling(Node, MicroKernelParams, 1);"

grosser edited edge metadata.Jul 11 2016, 6:47 AM

Hi Roman,

thanks for this patch and Michael thank you for your review! I added a couple of comments.

Best,
Tobias

include/polly/ScheduleOptimizer.h
128–129

Not part of this commit, but please add in a separate commit (no review needed) full title/author information to this paper, in case this link becomes dead at some point.

129

"interchanging The"

This seems incomplete.

131

This paper is already referenced earlier. No need to reference it again.

I personally would suggest to use @see to reference your implementation of the BLIS macro kernel, which than contains the macro-kernel specific information -- in sufficient detail to be self-contained without the BLIS paper.

132

Drop "it should be noticed". It does not add any additional information.

133

Drop "should". It does not add any specific information.

139

Make this a "TODO:"

139

Make this a TODO:

265

Can you document what "Mr" describes?

Also, the documentation of this function requires to have read the BLIS paper to understand what this parameter means. That seems not ideal. I suggest to expand this documentation to be (mostly) self contained.

(It seems you may have had more documentation here before. I suggest to move the documentation you have from optimizeMatMulPattern and add the @see in the other function).

lib/Transform/ScheduleOptimizer.cpp
138

What about" -polly-target-cache-level-associativity - The associativity of each cache level"

and CacheLevelAssociativity

144

What about "-polly-target-cache-level-sizes - The size of each cache level"?

and CacheLevelSizes

531

Nice!

540

than two.

554

In the subsequent patch you pull out the parameter calculation in a helper function. I suggest to do this already in this patch as otherwise code is just unnecessarily moved around.

559

Instead of just storing these parameters in a list of integers, I suggest to introduce a struct that allows us to encapsulate these parameters and access them by name.

struct MacroKernelParamsTy {                                                     
  int Mc;                                                                        
  int Nc;                                                                        
  int Kc;                                                                        
};                                                                               
                                                                                 
void foo() {                                                                     
  int Mc, Nc, Kc;                                                                
  struct MacroKernelParamsTy MacroKernelParams = {Mc, Nc, Kc};                   
}
560–562

Nice!

578

I am in favor of a new function. It might be mostly empty, but it helps to the code.

gareevroman edited edge metadata.

Hi Tobias,

Thank you for the comments! I've tried to address them. Should we pull out creation of the BLIS micro-kernel and its parameters in helper functions in a separate patch?

gareevroman marked 3 inline comments as not done.Jul 13 2016, 12:24 PM
gareevroman added inline comments.
include/polly/ScheduleOptimizer.h
128–156

I'll try to do it soon.

The test case was missed.

grosser accepted this revision.Jul 15 2016, 6:14 AM
grosser edited edge metadata.

Hi Roman,

this patch looks very good to me now. Please commit changes that just improve the code structure or documentation of existing code separately.

We can probably still improve the documentation of the cost functions further, but for now this looks sufficient. After all the infrastructure is in place, I expect that we perform further experiments and performance tuning. The results of these efforts will -- at the end of your summer of code -- certainly help us to extend the documentation.

include/polly/ScheduleOptimizer.h
48

Nice!

128–156

Very nice documentation!

144

"account in creation" -> "account during the creation of the BLIS of the BLIS kernels and the computation of their parameters."

155

The node is required to successfully pass ScheduleTreeOptimizer::isMatrMultPattern

263

The values of ...

277

The values passed in MicroKernelParam are used ...

lib/Transform/ScheduleOptimizer.cpp
538

Nice. This change and some of the changes to improve documentation can be committed separately.

548

Its 2nd for "second" or 1st for "first, "1nd" does not make sense.

557

We choose the Mr and Nr parameters of the micro kernel to be large enough such that no stalls

559

the updates _of_ the resulting

561

be as small as possible

583

During the computation of the matrix multiplication,

586

"It" sounds not right. What is it referring to?

This revision is now accepted and ready to land.Jul 15 2016, 6:14 AM
Meinersbur added inline comments.Jul 15 2016, 9:22 AM
lib/Transform/ScheduleOptimizer.cpp
535

MicroKernelParamsTy,MacroKernelParamsTy are small enough to be passed by value.

gareevroman added inline comments.Jul 16 2016, 12:28 AM
lib/Transform/ScheduleOptimizer.cpp
535

MicroKernelParamsTy,MacroKernelParamsTy are small enough to be passed by value.

Hi Michael,

shouldn't we follow advices from "Effective C++", according to LLVM Coding Standards?

As far as I understand, in item 20 it is stated that the only types for which we can reasonably assume that pass-by-value is inexpensive are build-in types and STL iterator and function object types.

Meinersbur added inline comments.Jul 19 2016, 9:50 AM
lib/Transform/ScheduleOptimizer.cpp
535

LLVM has its own types that are neither STL or built-ins that are meant to be passes by value: ArrayRef, StringRef, PointerIntPair, CallSite, etc.

"Effective C++" in the LLVM Coding Standards is referenced as the source of some recommendations, not as implicit import of rules.

We can reasonable assume that passing these types by value is inexpensive as we declared them ourselves as POD with few members in its header.

Since C++11 it can be better to pass even heavy objects by value if the function is going to consume/take ownership using std::move().

However, please feel free to not change it if you prefer this style.

mreisinger added inline comments.
test/ScheduleOptimizer/pattern-matching-based-opts_3.ll
66

"1nd" - see Tobias' note from above

70

"1nd" - see Tobias' note from above

gareevroman added inline comments.Jul 20 2016, 2:56 AM
lib/Transform/ScheduleOptimizer.cpp
535

Thanks for the explanation.

LLVM has its own types that are neither STL or built-ins that are meant to be passes by value: ArrayRef, StringRef, PointerIntPair, CallSite, etc.
"Effective C++" in the LLVM Coding Standards is referenced as the source of some recommendations, not as implicit import of rules.

As far as I understand, according to the description, LLVM Coding Standards should not be regarded as an absolute requirement to be followed in all instances and "Effective C++" is one of the two books, which are proposed as particularly important books for work. I just tried to understand how much they are important for us.

We can reasonable assume that passing these types by value is inexpensive as we declared them ourselves as POD with few members in its header.
Since C++11 it can be better to pass even heavy objects by value if the function is going to consume/take ownership using std::move().

Yes, I agree with it. My understanding is that in this case passing by reference can be a precaution, because implementation of a user-defined type can be changed in future.

However, please feel free to not change it if you prefer this style.

I prefer to follow LLVM Coding Standards. OK, I'll fix it so they are passed by value.

This revision was automatically updated to reflect the committed changes.

@grosser, @Meinersbur, @mreisinger, thank you for the comments! I've tried to address them in r276616 and r276627.