This is the second patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS macro-kernel by applying a combination of tiling and interchanging. In subsequent changes we will implement the packing transformation.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
This is a draft of the second patch to apply the BLIS matmul optimization pattern on matmul kernels. I'm not sure how it is better to implement interchanging of loops using schedule trees. That's why I would be very grateful for your comments, feedback and ideas.
I think that there are two ways to do it. The first way is to split a band node two times and perform sink to get the desired order of dimensions. In this case, the subtree of the band node would probably cause issues, if, for example, it contains mark nodes. As far as I know, they can't be sinked and should be manually reinserted.
The second way is to manually interchange dimensions of a partial schedule, delete the band node and insert a band node, which contains a new partial schedule. I've implemented it in the patch. However, there is a possible issue, because isl_schedule_node_delete can't be applied to a band node with an anchored subtree and isl_schedule_node_insert_partial_schedule can't be be applied to the schedule tree, which has anchored nodes.
Thanks for the patch and meaningful comments.
include/polly/ScheduleOptimizer.h | ||
---|---|---|
124 ↗ | (On Diff #61168) | vice versa |
243–252 ↗ | (On Diff #61168) | Please don't copy-paste the same text. There is a @see tag to refer to other parts. |
lib/Transform/ScheduleOptimizer.cpp | ||
539 ↗ | (On Diff #61168) | also check for CacheLevelAssociativityDegrees.size() >=2 to avoid out-of-range errors? |
553 ↗ | (On Diff #61168) | You can use int MacroKernelParams[] = {Mc, Nc, Kc}; here. It is automatically converted to ArrayRef<int>. |
560–562 ↗ | (On Diff #61168) | Style: No single return in an else branch. The following style seems to be established. if (c) { ... return X; } return Y; } |
577 ↗ | (On Diff #61168) | Add an empty line befire to make clear that what is coming is another section? |
lib/Transform/ScheduleOptimizer.cpp | ||
---|---|---|
539 ↗ | (On Diff #61776) | As far as I understand, it requires information about the first two levels of cache to determine all the parameters (Kc, Mc, Nc) of a macro-kernel. I've added a corresponding comment in the new version of the patch. |
560–562 ↗ | (On Diff #61776) | Thanks. In the new version I've tried to also take into account "Use Early Exits and continue to Simplify Code" |
578 ↗ | (On Diff #61776) | Yes, it's the reason. Probably, it's not good. I wasn't sure wether we should create a function to create a micro-kernel that contains only "std::vector<int> MicroKernelParams{Mr, Nr};" and "Node = applyRegisterTiling(Node, MicroKernelParams, 1);" |
Hi Roman,
thanks for this patch and Michael thank you for your review! I added a couple of comments.
Best,
Tobias
include/polly/ScheduleOptimizer.h | ||
---|---|---|
111 ↗ | (On Diff #61776) | Not part of this commit, but please add in a separate commit (no review needed) full title/author information to this paper, in case this link becomes dead at some point. |
119 ↗ | (On Diff #61776) | "interchanging The" This seems incomplete. |
121 ↗ | (On Diff #61776) | This paper is already referenced earlier. No need to reference it again. I personally would suggest to use @see to reference your implementation of the BLIS macro kernel, which than contains the macro-kernel specific information -- in sufficient detail to be self-contained without the BLIS paper. |
122 ↗ | (On Diff #61776) | Drop "it should be noticed". It does not add any additional information. |
123 ↗ | (On Diff #61776) | Drop "should". It does not add any specific information. |
129 ↗ | (On Diff #61776) | Make this a "TODO:" |
129 ↗ | (On Diff #61776) | Make this a TODO: |
247 ↗ | (On Diff #61776) | Can you document what "Mr" describes? Also, the documentation of this function requires to have read the BLIS paper to understand what this parameter means. That seems not ideal. I suggest to expand this documentation to be (mostly) self contained. (It seems you may have had more documentation here before. I suggest to move the documentation you have from optimizeMatMulPattern and add the @see in the other function). |
lib/Transform/ScheduleOptimizer.cpp | ||
138 ↗ | (On Diff #61776) | What about" -polly-target-cache-level-associativity - The associativity of each cache level" and CacheLevelAssociativity |
144 ↗ | (On Diff #61776) | What about "-polly-target-cache-level-sizes - The size of each cache level"? and CacheLevelSizes |
531 ↗ | (On Diff #61776) | Nice! |
540 ↗ | (On Diff #61776) | than two. |
554 ↗ | (On Diff #61776) | In the subsequent patch you pull out the parameter calculation in a helper function. I suggest to do this already in this patch as otherwise code is just unnecessarily moved around. |
559 ↗ | (On Diff #61776) | Instead of just storing these parameters in a list of integers, I suggest to introduce a struct that allows us to encapsulate these parameters and access them by name. struct MacroKernelParamsTy { int Mc; int Nc; int Kc; }; void foo() { int Mc, Nc, Kc; struct MacroKernelParamsTy MacroKernelParams = {Mc, Nc, Kc}; } |
560–562 ↗ | (On Diff #61776) | Nice! |
578 ↗ | (On Diff #61776) | I am in favor of a new function. It might be mostly empty, but it helps to the code. |
Hi Tobias,
Thank you for the comments! I've tried to address them. Should we pull out creation of the BLIS micro-kernel and its parameters in helper functions in a separate patch?
include/polly/ScheduleOptimizer.h | ||
---|---|---|
128–129 ↗ | (On Diff #63844) | I'll try to do it soon. |
Hi Roman,
this patch looks very good to me now. Please commit changes that just improve the code structure or documentation of existing code separately.
We can probably still improve the documentation of the cost functions further, but for now this looks sufficient. After all the infrastructure is in place, I expect that we perform further experiments and performance tuning. The results of these efforts will -- at the end of your summer of code -- certainly help us to extend the documentation.
include/polly/ScheduleOptimizer.h | ||
---|---|---|
48 ↗ | (On Diff #63846) | Nice! |
128–156 ↗ | (On Diff #63846) | Very nice documentation! |
144 ↗ | (On Diff #63846) | "account in creation" -> "account during the creation of the BLIS of the BLIS kernels and the computation of their parameters." |
155 ↗ | (On Diff #63846) | The node is required to successfully pass ScheduleTreeOptimizer::isMatrMultPattern |
263 ↗ | (On Diff #63846) | The values of ... |
277 ↗ | (On Diff #63846) | The values passed in MicroKernelParam are used ... |
lib/Transform/ScheduleOptimizer.cpp | ||
538 ↗ | (On Diff #63846) | Nice. This change and some of the changes to improve documentation can be committed separately. |
548 ↗ | (On Diff #63846) | Its 2nd for "second" or 1st for "first, "1nd" does not make sense. |
557 ↗ | (On Diff #63846) | We choose the Mr and Nr parameters of the micro kernel to be large enough such that no stalls |
559 ↗ | (On Diff #63846) | the updates _of_ the resulting |
561 ↗ | (On Diff #63846) | be as small as possible |
583 ↗ | (On Diff #63846) | During the computation of the matrix multiplication, |
586 ↗ | (On Diff #63846) | "It" sounds not right. What is it referring to? |
lib/Transform/ScheduleOptimizer.cpp | ||
---|---|---|
535 ↗ | (On Diff #63846) | MicroKernelParamsTy,MacroKernelParamsTy are small enough to be passed by value. |
lib/Transform/ScheduleOptimizer.cpp | ||
---|---|---|
535 ↗ | (On Diff #63846) |
Hi Michael, shouldn't we follow advices from "Effective C++", according to LLVM Coding Standards? As far as I understand, in item 20 it is stated that the only types for which we can reasonably assume that pass-by-value is inexpensive are build-in types and STL iterator and function object types. |
lib/Transform/ScheduleOptimizer.cpp | ||
---|---|---|
535 ↗ | (On Diff #63846) | LLVM has its own types that are neither STL or built-ins that are meant to be passes by value: ArrayRef, StringRef, PointerIntPair, CallSite, etc. "Effective C++" in the LLVM Coding Standards is referenced as the source of some recommendations, not as implicit import of rules. We can reasonable assume that passing these types by value is inexpensive as we declared them ourselves as POD with few members in its header. Since C++11 it can be better to pass even heavy objects by value if the function is going to consume/take ownership using std::move(). However, please feel free to not change it if you prefer this style. |
lib/Transform/ScheduleOptimizer.cpp | ||
---|---|---|
535 ↗ | (On Diff #63846) | Thanks for the explanation.
As far as I understand, according to the description, LLVM Coding Standards should not be regarded as an absolute requirement to be followed in all instances and "Effective C++" is one of the two books, which are proposed as particularly important books for work. I just tried to understand how much they are important for us.
Yes, I agree with it. My understanding is that in this case passing by reference can be a precaution, because implementation of a user-defined type can be changed in future.
I prefer to follow LLVM Coding Standards. OK, I'll fix it so they are passed by value. |
@grosser, @Meinersbur, @mreisinger, thank you for the comments! I've tried to address them in r276616 and r276627.