This is an archive of the discontinued LLVM Phabricator instance.

[Polly] Change the loop order of micro and macro kernels
ClosedPublic

Authored by gareevroman on Oct 16 2016, 12:59 AM.

Details

Summary

The order of the loops defines the data reused in the BLIS implementation of gemm ([1]). In particular, elements of the matrix B, the second operand of matrix multiplication, are reused between iterations of the innermost loop. To keep the reused data in cache, only elements of matrix A, the first operand of matrix multiplication, should be evicted during an iteration of the innermost loop. To provide such a cache replacement policy, elements of the matrix A can, in particular, be loaded first and, consequently, be least-recently-used.

In our case matrices are stored in row-major order instead of column-major order used in the BLIS implementation ([1]). One of the ways to address it is to accordingly change the order of the loops of the loop nest. However, it makes elements of the matrix A to be reused in the innermost loop and, consequently, requires to load elements of the matrix B first. Since the LLVM vectorizer always generates loads from the matrix A before loads from the matrix B and we can not provide it. Consequently, we only change the BLIS micro kernel and the computation of its parameters instead. In particular, reused elements of the matrix B are successively multiplied by specific elements of the matrix A .

Refs.:
[1] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf

Diff Detail

Repository
rL LLVM

Event Timeline

gareevroman retitled this revision from to [Polly] Change the loop order of micro and macro kernels.
gareevroman updated this object.
gareevroman added a subscriber: pollydev.
gareevroman updated this object.Oct 16 2016, 1:02 AM
grosser edited edge metadata.Oct 17 2016, 10:15 PM

Hi Roman,

the general direction looks fine to me. Two questions:

  • Can you add comments that explain why this interchange is done?
  • How does this impact performance on your system?
gareevroman updated this object.
gareevroman edited edge metadata.

Hi Roman,

the general direction looks fine to me. Two questions:

Hi Tobias,

thanks for the comments!

  • Can you add comments that explain why this interchange is done?

Maybe we could extend the comments of ScheduleTreeOptimizer::optimizeMatMulPattern. I've done it in the new version of the patch.

  • How does this impact performance on your system?

I've added this information to the summary. This patch helps to attain 40.45% of theoretical peak. However, determination of optimal values of the Nc parameter will help to attain 77,74% of theoretical peak.

Hi Roman,

sorry for the delay. I think the documentation clearly needs to be more extensive here. At the very least you should include the comment from the commit message in the source code. More importantly, it would be good to explain precisely what [1] assumes, what the LLVM vectorizer does, how this differs and with what changes to the algorithm in [1] we take care of this difference.

Best,
Tobias

gareevroman updated this object.

Hi Roman,

sorry for the delay. I think the documentation clearly needs to be more extensive here. At the very least you should include the comment from the commit message in the source code. More importantly, it would be good to explain precisely what [1] assumes, what the LLVM vectorizer does, how this differs and with what changes to the algorithm in [1] we take care of this difference.

Hi Tobias,

thanks for the comments! I've tried to address them.

grosser accepted this revision.Dec 15 2016, 3:49 AM
grosser edited edge metadata.
This revision is now accepted and ready to land.Dec 15 2016, 3:49 AM
This revision was automatically updated to reflect the committed changes.