The tree height reduction optimization increases the instruction-level parallelism by changing the order of operations in a loop to keep the operation tree as low as possible.
For example, see the the following code.
for (int i = 0; i < N; ++i) x[i] = a0[i] + a1[i] + a2[i] + a3[i] + a4[i] + a5[i] + a6[i] + a7[i];
This is equivalent to the following code. Each addition depends on the result of the preceding addition.
for (int i = 0; i < N; ++i) x[i] = ((((((a0[i] + a1[i]) + a2[i]) + a3[i]) + a4[i]) + a5[i]) + a6[i]) + a7[i];
Tree height reduction transforms it to the following code. Additions in innermost parentheses can be executed in parallel.
for (int i = 0; i < N; ++i) x[i] = ((a0[i] + a1[i]) + (a2[i] + a3[i])) + ((a4[i] + a5[i]) + (a6[i] + a7[i]));
The implemented algorithm is based on the following paper:
- Katherine Coons, Warren Hunt, Bertrand A. Maher, Doug Burger, Kathryn S. McKinley. Optimal Huffman Tree-Height Reduction for Instruction-Level Parallelism.
Applicable conditions
This patch incorporates tree height reduction pass into the default optimization pipeline but it is disabled by default. You need the -enable-int-thr option for integer operations (add, mul, and, or, and xor) and the -enable-fp-thr option for floating-point operations (fadd and fmul) to enable it. Furthermore, for floating-point operations, you also need reassoc and nsz flags to fadd/fmul.
You can use it via Clang by:
clang -O1 -fassociative-math -fno-signed-zeros -mllvm -enable-int-thr -mllvm -enable-fp-thr
Or, simply:
clang -Ofast -mllvm -enable-int-thr -mllvm -enable-fp-thr
It is only applied to operations in innermost loops.
Performance
I run C/C++ benchmarks in SPECspeed 2017 on Fujitsu A64FX processor, which has two pipelines for integer operations and SIMD/FP operations each. 600.perlbench_s and 619.lbm_s had 3% improvement. Other benchmarks (602, 605, 620, 623, 625, 631, 641, 644, 657) were within 1% up/down. In these runs, to emphasize the performance improvement, the number of OpenMP threads is limited to one.
Relation to D67383
This patch is an updated version of D67383. The author @masakazu.ueno was my colleague. I took over his patch.
The following comments in D67383 are addressed in this patch.
- Support bitwise instructions
- Use llvm::stable_sort
- Correct tests
- Add TTI comment
- Correct misleading comment
Also, this patch has following updates.
- Adapt to the latest main branch (new pass manager, opaque pointer, Apache license, ...)
- Remove the requirement of full fast-math flags
- Fix bugs
- Simplify the code
- etc.
Future work
Currently the cost estimation is not implemented. Of course, as @hfinkel said, this optimization has no positive effect if the target processor cannot utilize ILP. I want to add some cost estimation or something and enable this optimization automatically when profitable.
And, I want to add a Clang option to enable/disable it.
These will be addressed in another patch.
Discussion
I want comments about the following points.
- This optimization is only applied to instructions in innermost loops because D67383 was implemented so. Should we expand it to instructions in all basic blocks?
- Where is the best position of this pass in the default optimization pipeline? I put it after the loop unrolling pass. Because, if a loop has a reduction, the reduction can also be optimized by this pass.
- As explained in the future work above, I want to add decision to apply the optimization or not. What information this decision should be based on? Issue width in a machine model?
for the root of tree, we may not need to check whether it is isBranchCandidate or not.