This is the fourth patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we perform copying to created arrays, which is the last step to implement the packing transformation.
Details
- Reviewers
Meinersbur grosser jdoerfert - Commits
- rGb3224adfb61a: Perform copying to created arrays according to the packing transformation
rPLO281441: Perform copying to created arrays according to the packing transformation
rL281441: Perform copying to created arrays according to the packing transformation
Diff Detail
- Repository
- rL LLVM
Event Timeline
In this draft of the patch copy statements are represented as Scop statements whose values of BB and R fields are equal to nullptr. These statements have only two memory accesses. The first one represents reading from an array that should be packed. The second one represents writing to a created array. What do you think about it?
The current implementation of the idea have several drawbacks. I would be grateful for your comments and ideas about them.
One of the drawbacks is manual deletion of copy statements in IslNodeBuilder::generateCopyStmt. We could, for example, store mapping between packed memory accesses and copy statements, to be able to delete all created copy statements in a destructor of the Stmt class.
Another drawback is a location where the corresponding IR is generated. Should we generate it in BlockGenerators? Probably it requires addition of a new class (e.g. CopyGenerator). At the moment the corresponding IR is generated in IslNodeBuilder::generateCopyStmt.
I have one more question related to generation of the corresponding IR. As far as I understand, to generate it, we should create one load and one store intrusion. In the case of the load instruction, we could use IslExprBuilder::create. However, to generate the store instruction, we should probably be able to call IslExprBuilder::createAccessAddress that is a private member of the IslExprBuilder class. Maybe we should make it a public member (this approach is used in the patch).
The last question is related to utilization of extension nodes. I had to comment out the calls of ScheduleTreeOptimizer::isProfitableSchedule and isl_schedule_get_map, because otherwise I get the following error:
"/home/roman/Documents/polly/llvm/tools/polly/lib/External/isl/isl_schedule_tree.c:1820: cannot construct subtree schedule of tree with extension nodes"
How to apply isl_schedule_get_map in case a schedule tree has extension nodes?
include/polly/CodeGen/IslExprBuilder.h | ||
---|---|---|
169 ↗ | (On Diff #67148) | When becoming part of the public interface, please add doucmentation. |
include/polly/CodeGen/IslNodeBuilder.h | ||
353–354 ↗ | (On Diff #67148) | It is not clear from where to where something is copied. I'd expect parameters for specifying the source and target locations, the subrange to be copied and a mapping between the index in the source to the index of target. |
lib/Analysis/ScopInfo.cpp | ||
1443–1477 ↗ | (On Diff #67148) | Please don't do significant work in the constructor. Constructors are for initializing members. Such code make reuse of that functionality from other parts more difficult and calling virtual member functions may not work as expected. |
1468 ↗ | (On Diff #67148) | Both memory accesses seem to read and write from the same array at the same index. This is not intended, is it? |
lib/CodeGen/IslNodeBuilder.cpp | ||
791 ↗ | (On Diff #67148) | Why does this delete Stmt? |
810 ↗ | (On Diff #67148) | Would it be possible to handle the copy statements as special case of the BlockGenerator, just with BB set to nullptr? |
Hi Michael,
thanks for the comments!
include/polly/CodeGen/IslExprBuilder.h | ||
---|---|---|
169 ↗ | (On Diff #67148) | OK. However, I'm not sure whether it's fine to make it a public member. |
include/polly/CodeGen/IslNodeBuilder.h | ||
353–354 ↗ | (On Diff #67148) | I think that we could use createCopyStmt for this purpose. It accepts an extension map of an extension node that helps to specify the subrange to be copied and a mapping between the index in the source to the index of target. Another parameter is a memory access whose original and new access relations specify the source and target locations, respectively. What do you think about it? IslNodeBuilder::generateCopyStmt is used to generate IR that corresponds to the body of a copy statement. If we handle the copy statements as special case of the BlockGenerator, we could probably remove the function. |
lib/Analysis/ScopInfo.cpp | ||
1443–1477 ↗ | (On Diff #67148) | OK. |
1468 ↗ | (On Diff #67148) | As far as I understand, they read and write from different arrays, if ScopArrayInfo objects accessed via the original and new access relations represent different BasePtrs. I'll add assert that checks it. |
lib/CodeGen/IslNodeBuilder.cpp | ||
791 ↗ | (On Diff #67148) | As far as I understand, to avoid this manual deletion of copy statements, they should be stored in the Stmts, which is a std::list<ScopStmt> that contains all statements of the SCoP and filled using Scop::addScopStmt. The destructor of the SCoP clears the list and hence deletes the statements. However, Scop::addScopStmt only creates a new SCoP statement for either a basic block or a region. Should we rewrite Scop::addScopStmt to be able to create a copy statement for a memory access? If we decide to do it, we should probably also have to make it public, because we'll call it in ScheduleTreeOptimizer. |
810 ↗ | (On Diff #67148) | I think that it's possible. Should we handle this case in BlockGenerator::copyStmt? |
include/polly/CodeGen/IslExprBuilder.h | ||
---|---|---|
169 ↗ | (On Diff #67148) | OK with me (we don't really have a stable public interface at this level). |
lib/Analysis/ScopInfo.cpp | ||
1468 ↗ | (On Diff #67148) | OK, I think I understand better now. It first create a no-op that reads and writes from/to the same location (MemA->getBaseAddr()/Subscripts) and only later the accesses are changed via setNewAccessRelation(). One two MemA->getOriginalAccessRelation(), the other to MemA->getAccessRelation(). I find this confusing, they are even using the same local variable NewAccessRelation for different accesses. I'd prefer to pass the source/target relation explicitly than taking from MemA. There is also no need to first initialize the MemoryAccesses using with an arbitrary access relation that is never used. You can add a constructor to initialize the AccessRelation with the right value from the beginning. AFAIU memory accesses should be added using emplace_back on the access list returned by scop->getOrCreateAccessFunctions(). Since we dont have a BB, passing nullptr should be OK as well. Otherwise we are leaking memory. |
lib/CodeGen/IslNodeBuilder.cpp | ||
791 ↗ | (On Diff #67148) | OK, you might want to change addScopStmt to something like ScopStmt* Scop::addScopStmt(BasicBlock *BB, Region *R) { if (BB) { Stmts.emplace_back(*this, *BB); auto *Stmt = &Stmts.back(); StmtMap[BB] = Stmt; return Stmt; } else if (R) { assert(R && "Either basic block or a region expected."); Stmts.emplace_back(*this, *R); auto *Stmt = &Stmts.back(); for (BasicBlock *BB : R->blocks()) StmtMap[BB] = Stmt; return Stmt; } Stmts.emplace_back(*this, (BasicBlock*)nullptr); return Stmts.back(); } |
810 ↗ | (On Diff #67148) | Yes, I think it would be preferable to be able to share code for generating index expressions (generateLocationAccessed) |
lib/Transform/ScheduleOptimizer.cpp | ||
651 ↗ | (On Diff #67148) | @p DimNum (Renaming to just Dim could be useful to avoid confusion with "Number of Dimensions") |
681 ↗ | (On Diff #67148) | @brief |
690 ↗ | (On Diff #67148) | @p SecondOutputDimBound |
Hi Michael,
thank you for the comments! I've tried to address them in a new version of the patch.
lib/Analysis/ScopInfo.cpp | ||
---|---|---|
1465–1499 ↗ | (On Diff #68212) | I've modified constructors of ScopStmt and MemoryAccess in a new version of the patch. If you think that significant work is done in them, I'll try to factor out the code. |
1490 ↗ | (On Diff #68212) |
I've tried to implement it. However, at the moment copy statements require domain of a corresponding statement that contains a memory access, which is copied. I'll try understand how to avoid it.
Should we make getOrCreateAccessFunctions public in this case? |
lib/CodeGen/IslNodeBuilder.cpp | ||
783 ↗ | (On Diff #68212) | I'm afraid that it's not possible to implement such a change, because there is no a constructor for a ScopStmt, that accepts pointers. Maybe I'm missing something. I've tried to use overloaded addScopStmt of the following form ScopStmt* Scop::addScopStmt(__isl_take isl_map *SourceRel, __isl_take isl_map *TargetRel, __isl_take isl_set *Domain) { Stmts.emplace_back(*this, SourceRel, TargetRel, Domain); return Stmts.back(); } In case of arrays created by Polly, it causes an error, because ScopAnnotator::buildAliasScopes uses names of BasePtrs in the following line: MDString::get(Ctx, (AliasScopeStr + BasePtr->getName()).str().c_str())); However, BasePtrs aren't known at the moment of calling of the ScopAnnotator::buildAliasScopes, because at this moment they aren't allocated. I haven't come up with a solution yet. |
801 ↗ | (On Diff #68212) | Sorry, I don't understand how to share generateLocationAccessed. If I'm not mistaken, in case of copy statements, basic blocks are empty and we can't call, for example, getPointerOperand. For the same reason I don't understand how generateLocationAccessed can be called in case of copy statements. I thought that we could probably rewrite BlockGenerator::copyStmt and get something like this: void BlockGenerator::copyStmt(ScopStmt &Stmt, LoopToScevMapT <S, isl_id_to_ast_expr *NewAccesses) { if (Stmt.isBlockStmt()) { ValueMapT BBMap; BasicBlock *BB = Stmt.getBasicBlock(); copyBB(Stmt, BB, BBMap, LTS, NewAccesses); removeDeadInstructions(BB, BBMap); } else { // handle a copy statement } } |
Hi Roman,
I went once more over the patch. The overall direction looks good, but I have a couple of changes to suggest. Also, i believe we should separate the extensions to general Polly from your matmul specific changes. Specifically, I suggest to pull out two patches:
- Simplify AccFuncMap to vector<> AccessFunctions
- Introduction of CopyStmts (via jscop)
plus the final gemm patch. I know this adds some additional work, but as I said earlier I prefer to have clean and thought through patches, rather than rushing to push things in. Obviously, for your GSoC evaluation I will take into account that you spent additional time on cleaning and improving the general Polly infrastructure.
include/polly/CodeGen/IslExprBuilder.h | ||
---|---|---|
169 ↗ | (On Diff #68212) | @brief Create LLVM-IR that computes the memory location of an access expression. For a given isl_ast_expr[ession] of type isl_ast_op_access this function creates IR |
include/polly/CodeGen/IslNodeBuilder.h | ||
382 ↗ | (On Diff #68212) | @brief Create code for a copy statement A copy statement is expected to have one read memory access and one write memory access (in this very order). Data is loaded from the location described by the read memory access and written to the location described by the write memory access. @p NewAccesses contains for each access the isl ast expression that describes the location accessed. |
lib/Analysis/ScopInfo.cpp | ||
1469 ↗ | (On Diff #68212) | This will result in both copy statements to have the very same name. This is probably not what we want. Can we add a "_<id>" suffix that distinguishes multiple copy statements? (At best we keep track of the number of copy statements created for a scop and use this as an index. |
1473 ↗ | (On Diff #68212) | Passing 'nullptr' here means that we will not be able to distinguish accesses from different copy statements. Interestingly, it seems we are not even using this facility any more. getAccessFunctions() seems to be dead code and I also don't see the 'BB' argument of getOrCreateAccessFunctions() to be used. The cleanest solution would probably prepare a cleanup patch that deletes getAccessFunctions and that transforms AccFuncMap into a std::vector<> AccessFunctions. This would remove avoid the need to pass a nullptr here, which really seems to be an unnecessary workaround. |
lib/CodeGen/IslNodeBuilder.cpp | ||
783 ↗ | (On Diff #68212) | Maybe avoid the loop and just assume load first and store second. We will always generate the copyStmt like this, so there is no need to add complexity to be more generic. Please also add asserts that check that both accesses exist, that they use the same data type and that both accesses are array accesses for now. |
801 ↗ | (On Diff #68212) | I am fine with the solution you choose for now, but we may want to think later if we can structure this a little nicer. |
lib/Transform/ScheduleOptimizer.cpp | ||
716 ↗ | (On Diff #68212) | where |
1202 ↗ | (On Diff #68212) | Ah, there ist the problem that extension nodes cannot be handled. Commenting this out is probably the worst choice. I would suggest to change isProfitableSchedule to take a ScheduleTree, and as a first condition in isProfitableSchedule check if the schedule contains extension nodes. In this case just assume its profitable (as we apparently did a larger change to make this happen). |
test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll | ||
19 ↗ | (On Diff #68212) | Uh, I really dislike more and more the fact that your schedule transformation is tested here. Can we just add a flag to ScheduleOptimizer which dumps the new scop and check for the new scop containing the right statements? As well as the -polly-ast generating the right code? Isl/CodeGen/MemAccess should really just test that we generate correct code for newly added memory accesses added through a jscop input. |
Hi Tobias,
thank you for the comments! I'll try to address them.
test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll | ||
---|---|---|
19 ↗ | (On Diff #68212) |
I've tried to implement this in
It also causes the following error error: /home/roman/Documents/polly/llvm/tools/polly/lib/External/isl/isl_schedule_tree.c:1820: cannot construct subtree schedule of tree with extension nodes because S.getSchedule() called from IslAstInfo::printScop contain isl_schedule_get_map. Thank you for the comment. I'll try to understand how to avoid the issue. |
test/Isl/CodeGen/MemAccess/mat_mul_pattern_data_layout.ll | ||
---|---|---|
19 ↗ | (On Diff #68212) |
I've tried to implement this in https://reviews.llvm.org/D23740. |
This is a new version of the patch updated according to the comments.
Introduction of CopyStmts (via jscop)
Sorry, I haven't understood how to do it properly yet.
To import a copy statement, we could possibly specify its extension map and try to find its position in the schedule tree and insert it. To be able to import an exported copy statement, we should probably store its extension map. Maybe it is wrong way.
There is also a problem related to exporting CopyStmts. We use ScopStmt::getScheduleStr() to export a schedule of a statement. It calls ScopStmt::getSchedule that calls Scop::getSchedule, which calls isl_schedule_get_map that causes:
/home/roman/Documents/polly/llvm/tools/polly/lib/External/isl/isl_schedule_tree.c:1820: cannot construct subtree schedule of tree with extension nodes
For the same reason we can't use ScopInfo's to check the work of the IslScheduleOptimizer pass at the polyhedral level in this case (that's why I temporary use the previous variant of the test)
We could make ScopStmt::getScheduleStr() return an empty string in case a schedule tree has extension nodes. It doesn't solve a problem related to export of CopyStmts. However, it could probably help to build a test case.
lib/Transform/ScheduleOptimizer.cpp | ||
---|---|---|
1211 ↗ | (On Diff #68816) | Could you please advise me how to check if the schedule contains extension nodes? Should we traverse the whole schedule tree and check types of every node? |
Hi Roman,
thanks for this update:
Right. Extension nodes are complicated. In the end, we would probably need to import a full schedule tree to describe their positions correctly. However, for the first tests it's enough to just assume all new statements are inserted at the very top of the schedule tree. Your actual transformation will insert them somewhere else, but that's how it is.
There is also a problem related to exporting CopyStmts. We use ScopStmt::getScheduleStr() to export a schedule of a statement. It calls ScopStmt::getSchedule that calls Scop::getSchedule, which calls isl_schedule_get_map that causes:
/home/roman/Documents/polly/llvm/tools/polly/lib/External/isl/isl_schedule_tree.c:1820: cannot construct subtree schedule of tree with extension nodes
For the same reason we can't use ScopInfo's to check the work of the IslScheduleOptimizer pass at the polyhedral level in this case (that's why I temporary use the previous variant of the test)
We could make ScopStmt::getScheduleStr() return an empty string in case a schedule tree has extension nodes. It doesn't solve a problem related to export of CopyStmts. However, it could probably help to build a test case.
Yes, we can have it return an empty string. There are a couple of cases where a schedule tree with extension nodes cannot be moveled with a flat schedule. We probably need to handle them conservatively. (IslAst also uses getSchedule. In this case, we can probably replace it with getScheduleTree)
Best,
Tobias
include/polly/CodeGen/IslExprBuilder.h | ||
---|---|---|
173 ↗ | (On Diff #68816) | Drop the 'of' |
175 ↗ | (On Diff #68816) | that has -> of |
lib/Transform/ScheduleOptimizer.cpp | ||
1211 ↗ | (On Diff #68816) | Yes, this is probably the only choice. isl has some functions that allow to traverse the schedule tree. You could use these. |
Hi Tobias,
thank you for the comments!
Right. Extension nodes are complicated. In the end, we would probably need to import a full schedule tree to describe their positions correctly. However, for the first tests it's enough to just assume all new statements are inserted at the very top of the schedule tree. Your actual transformation will insert them somewhere else, but that's how it is.
I'll try to implement it.
There is also a problem related to exporting CopyStmts. We use ScopStmt::getScheduleStr() to export a schedule of a statement. It calls ScopStmt::getSchedule that calls Scop::getSchedule, which calls isl_schedule_get_map that causes:
/home/roman/Documents/polly/llvm/tools/polly/lib/External/isl/isl_schedule_tree.c:1820: cannot construct subtree schedule of tree with extension nodes
For the same reason we can't use ScopInfo's to check the work of the IslScheduleOptimizer pass at the polyhedral level in this case (that's why I temporary use the previous variant of the test)
We could make ScopStmt::getScheduleStr() return an empty string in case a schedule tree has extension nodes. It doesn't solve a problem related to export of CopyStmts. However, it could probably help to build a test case.
Yes, we can have it return an empty string. There are a couple of cases where a schedule tree with extension nodes cannot be moveled with a flat schedule. We probably need to handle them conservatively. (IslAst also uses getSchedule. In this case, we can probably replace it with getScheduleTree)
I've tried to implement it in this version of a patch.
Hi Roman,
besides some minor comments, this now looks good to me. Thank you for your work.
Best,
Tobias
include/polly/ScopInfo.h | ||
---|---|---|
2324 ↗ | (On Diff #69482) | Unnecessary line break. |
2360 ↗ | (On Diff #69482) | containsExtensionNodes 'Ext' is not a commonly used abbreviation, so people will have difficulties to expand them in their mind. |
lib/Analysis/DependenceInfo.cpp | ||
168 ↗ | (On Diff #69480) | require |
169 ↗ | (On Diff #69480) | require |
606 ↗ | (On Diff #69480) | require |
lib/Analysis/PolyhedralInfo.cpp | ||
138 ↗ | (On Diff #69482) | require |
lib/Analysis/ScopInfo.cpp | ||
1473 ↗ | (On Diff #69482) | Could it happen that this conflicts with a Statement that we generate for a basic block with the name "Copy"? If this is the case, maybe name it "CopyStmt_..."? Also, could you add a test case that covers this situation? |
lib/Exchange/JSONExporter.cpp | ||
293 ↗ | (On Diff #69482) | require |
lib/Transform/DeadCodeElimination.cpp | ||
96 ↗ | (On Diff #69482) | require |
lib/Transform/ScheduleOptimizer.cpp | ||
1023 ↗ | (On Diff #69482) | that make |
Hi Tobias,
thanks for the comments and review!
lib/Analysis/ScopInfo.cpp | ||
---|---|---|
1463 ↗ | (On Diff #71162) |
Yes, I think it's possible in case, for example, a basic block has a name "Copy_0".
Could we change the name of a basic block from an existing test case to do it? I've implemented it in the last version of patch. |