This is the first patch to apply the BLIS matmul optimization pattern on matmul kernels (http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS micro-kernel by applying a combination of tiling and unrolling. In subsequent changes we will add the extraction of the BLIS macro-kernel and implement the packing transformation.
Details
Diff Detail
Event Timeline
I haven't yet found out how to get throughput of vector instructions per clock cycle and latency of instructions That's why these values are passed as command line parameters. Maybe TargetTransformInfo::getArithmeticInstrCost can be used for this purpose. However, I haven't found an algorithm that is used by target architectures to compute a cost in TargetTransformInfoImpl.
Hi Roman,
the patch looks generally good. However, I have a couple of smaller comments.
Some comments to the commit message:
- Does this already give some speedups for your kernel. In case it does, can you state the improvement in the commit message?
- You already give an overview over what BLIS does in the commit message. Can you explicitly state which part this patch implements (you do this partially) and what the subsequent steps are. Something like:
This is the first patch to apply the BLIS matmul optimization pattern on matmul kernels (URL/reference). BLIS implements gemm as three nested loops around a macro-kernel, plus two packing routines. The macro-kernel is implemented in terms of two additional loops around a micro-kernel. The micro-kernel is a loop around a rank-1 (i.e., outer product) update. In this change we create the BLIS micro-kernel by applying a combination of tiling and unrolling. In subsequent changes we will .....
include/polly/ScheduleOptimizer.h | ||
---|---|---|
106 | The first time I read this function name I got the impression you want to register a tile node. Maybe a name such as applyRegisterTiling() could help to avoid such issues. | |
109 | that is used where? | |
111 | Why Node is repeated here? | |
114 | I would personally call this 'optimizeMatMulPattern'. | |
lib/Transform/ScheduleOptimizer.cpp | ||
124 | The option names are very cryptic. Can you spell them out to make them more understandable. Also, maybe add a prefix such as -polly-target-latency-vector-fma? And below -polly-target-througput-vector-fma (if these are the correct names)? Also, please use more descriptive variable names. | |
125 | The minimal number of cycles between issuing two dependent .consecutive vector fused multiply-add instructions. Also you repeat here instructions twice. | |
134 | The second part does not seem grammatically correct. | |
362 | Outlining this function is a preparing transformation. I would suggest to commit this separately as NFC cleanup in preparation of this overall commit No additional pre-commit review needed for such a change. | |
404 | Nice cleanup. | |
510 | This clearly would benefit from a longer comment at the top of this function definition that describes what we are doing here, where we got the ideas from and where the cost functions are derived from. |
Hi Tobias,
thank you for the comments! I’ve tried to address them in this version of the patch.
P.S.: I haven’t got the results of the nightly tests yet. However, I get the following numbers, if I try to compile gems:
clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -DPOLYBENCH_USE_SCALAR_LB -march=native -Xclang -load -Xclang /tmp_home/compiled/llvm_d/lib/LLVMPolly.so -mllvm -polly -mllvm -polly-pattern-matching-based-opts=false -mllvm -polly-target-latency-vector-fma=8 -mllvm -polly-target-througput-vector-fma=1
0.750034
clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME -DPOLYBENCH_USE_SCALAR_LB -march=native -Xclang -load -Xclang /tmp_home/compiled/llvm_d/lib/LLVMPolly.so -mllvm -polly -mllvm -polly-pattern-matching-based-opts=true -mllvm -polly-target-latency-vector-fma=8 -mllvm -polly-target-througput-vector-fma=1
0.236387
Hi Tobias,
in case of perf-x86_64-penryn-O3-polly (http://gcc12.fsffrance.org:8011/builders/perf-x86_64-penryn-O3-polly) we doesn't get speedups for the matmul kernels:
Performance-Regressions-Compile-Time | Δ | Previous | Current | σ |
MultiSource/Benchmarks/VersaBench/bmm/bmm | 32.14% | 0.2240 | 0.2960 | 0.0060 |
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/2mm/2mm | 21.87% | 0.5120 | 0.6240 | 0.0048 |
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/3mm/3mm | 20.29% | 0.8280 | 0.9960 | 0.0062 |
SingleSource/Benchmarks/Polybench/datamining/covariance/covariance | 14.05% | 0.4840 | 0.5520 | 0.0048 |
SingleSource/Benchmarks/Polybench/datamining/correlation/correlation | 14.04% | 0.4560 | 0.5200 | 0.0053 |
SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm | 12.50% | 0.2880 | 0.3240 | 0.0070 |
Performance Regressions - Execution Time | Δ | Previous | Current | σ |
SingleSource/Benchmarks/Polybench/datamining/correlation/correlation | 38.42% | 0.7600 | 1.0520 | 0.0047 |
SingleSource/Benchmarks/Polybench/datamining/covariance/covariance | 34.21% | 0.7600 | 1.0200 | 0.0068 |
MultiSource/Benchmarks/SciMark2-C/scimark2 | 2.48% | 63.1080 | 64.6720 | 0.0120 |
Performance-Improvements-Execution-Time | Δ | Previous | Current | σ |
MultiSource/Benchmarks/VersaBench/bmm/bmm | -18.69% | 2.2040 | 1.7920 | 0.0150 |
Maybe we should try to specify values of LatencyVectorFma and ThrougputVectorFma. Could you please advise me where I can find parameters of perf-x86_64-penryn-O3-polly? I think that the model name of its processor would be enough to determine LatencyVectorFma and ThrougputVectorFma.
I've also found out that we get compile time errors of SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu and SingleSource/Benchmarks/Polybench/stencils/adi/adi caused by the following failed assertion:
assert(isl_map_dim(NewPartialSchedule, isl_dim_out) == 3 && "Each schedule dimension should be represented by a union piecewise quasi-affine expression.");
Sorry that I didn't test it on a debug build. I'll try to fix the issue soon.
The first time I read this function name I got the impression you want to register a tile node. Maybe a name such as applyRegisterTiling() could help to avoid such issues.