This is an archive of the discontinued LLVM Phabricator instance.

[GSoC 2016] [Polly] [WIP] Apply all necessary tilings and unrollings to get a micro-kernel
ClosedPublic

Authored by gareevroman on Jun 8 2016, 8:53 AM.

Details

Summary

This is the first patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS micro-kernel by applying a combination of tiling and unrolling. In subsequent changes we will add the extraction of the BLIS macro-kernel and implement the packing transformation.

Diff Detail

Event Timeline

gareevroman updated this revision to Diff 60048.Jun 8 2016, 8:53 AM
gareevroman retitled this revision from to [GSoC 2016] [Polly] [WIP] Apply all necessary tilings and unrollings to get a micro-kernel.
gareevroman updated this object.
gareevroman added a subscriber: pollydev.

I haven't yet found out how to get throughput of vector instructions per clock cycle and latency of instructions That's why these values are passed as command line parameters. Maybe TargetTransformInfo::getArithmeticInstrCost can be used for this purpose. However, I haven't found an algorithm that is used by target architectures to compute a cost in TargetTransformInfoImpl.

grosser edited edge metadata.Jun 11 2016, 1:29 AM

Hi Roman,

the patch looks generally good. However, I have a couple of smaller comments.

Some comments to the commit message:

  • Does this already give some speedups for your kernel. In case it does, can you state the improvement in the commit message?
  • You already give an overview over what BLIS does in the commit message. Can you explicitly state which part this patch implements (you do this partially) and what the subsequent steps are. Something like:

This is the first patch to apply the BLIS matmul optimization pattern on matmul kernels (URL/reference). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS micro-kernel by applying a combination of tiling and unrolling. In subsequent changes we will .....

include/polly/ScheduleOptimizer.h
106

The first time I read this function name I got the impression you want to register a tile node. Maybe a name such as applyRegisterTiling() could help to avoid such issues.

109

that is used where?

111

Why Node is repeated here?

114

I would personally call this 'optimizeMatMulPattern'.

lib/Transform/ScheduleOptimizer.cpp
125

The option names are very cryptic. Can you spell them out to make them more understandable. Also, maybe add a prefix such as -polly-target-latency-vector-fma?

And below -polly-target-througput-vector-fma (if these are the correct names)?

Also, please use more descriptive variable names.

126

The minimal number of cycles between issuing two dependent .consecutive vector fused multiply-add instructions.

Also you repeat here instructions twice.

135

The second part does not seem grammatically correct.

363

Outlining this function is a preparing transformation. I would suggest to commit this separately as NFC cleanup in preparation of this overall commit No additional pre-commit review needed for such a change.

405

Nice cleanup.

510

This clearly would benefit from a longer comment at the top of this function definition that describes what we are doing here, where we got the ideas from and where the cost functions are derived from.

gareevroman updated this object.
gareevroman edited edge metadata.

Hi Tobias,

thank you for the comments! I’ve tried to address them in this version of the patch.

P.S.: I haven’t got the results of the nightly tests yet. However, I get the following numbers, if I try to compile gems:

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -DPOLYBENCH_USE_SCALAR_LB -march=native -Xclang -load -Xclang /tmp_home/compiled/llvm_d/lib/LLVMPolly.so -mllvm -polly -mllvm -polly-pattern-matching-based-opts=false -mllvm -polly-target-latency-vector-fma=8 -mllvm -polly-target-througput-vector-fma=1

0.750034

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -DPOLYBENCH_USE_SCALAR_LB -march=native -Xclang -load -Xclang /tmp_home/compiled/llvm_d/lib/LLVMPolly.so -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -mllvm -polly-target-latency-vector-fma=8 -mllvm -polly-target-througput-vector-fma=1

0.236387

grosser accepted this revision.Jun 12 2016, 11:03 AM
grosser edited edge metadata.

LGTM.

lib/Transform/ScheduleOptimizer.cpp
135

The throughput of ...

This revision is now accepted and ready to land.Jun 12 2016, 11:03 AM

Hi Tobias,

in case of perf-x86_64-penryn-O3-polly (http://gcc12.fsffrance.org:8011/builders/perf-x86_64-penryn-O3-polly) we doesn't get speedups for the matmul kernels:

Performance-Regressions-Compile-TimeΔPreviousCurrentσ
MultiSource/Benchmarks/VersaBench/bmm/bmm32.14%0.22400.29600.0060
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm21.87%0.51200.62400.0048
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm20.29%0.82800.99600.0062
SingleSource/Benchmarks/Polybench/datamining/covariance/covariance14.05%0.48400.55200.0048
SingleSource/Benchmarks/Polybench/datamining/correlation/correlation14.04%0.45600.52000.0053
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm12.50%0.28800.32400.0070
Performance Regressions - Execution TimeΔPreviousCurrentσ
SingleSource/Benchmarks/Polybench/datamining/correlation/correlation38.42%0.76001.05200.0047
SingleSource/Benchmarks/Polybench/datamining/covariance/covariance34.21%0.76001.02000.0068
MultiSource/Benchmarks/SciMark2-C/scimark22.48%63.108064.67200.0120
Performance-Improvements-Execution-TimeΔPreviousCurrentσ
MultiSource/Benchmarks/VersaBench/bmm/bmm-18.69%2.20401.79200.0150

Maybe we should try to specify values of LatencyVectorFma and ThrougputVectorFma. Could you please advise me where I can find parameters of perf-x86_64-penryn-O3-polly? I think that the model name of its processor would be enough to determine LatencyVectorFma and ThrougputVectorFma.

I've also found out that we get compile time errors of SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu and SingleSource/Benchmarks/Polybench/stencils/adi/adi caused by the following failed assertion:

assert(isl_map_dim(NewPartialSchedule, isl_dim_out) == 3 && "Each schedule dimension should be represented by a union piecewise quasi-affine expression.");

Sorry that I didn't test it on a debug build. I'll try to fix the issue soon.

This revision was automatically updated to reflect the committed changes.