This is an archive of the discontinued LLVM Phabricator instance.

Add new optimization pass of Tree Height Reduction
AbandonedPublic

Authored by kawashima-fj on Aug 29 2022, 12:44 AM.

Details

Summary

The tree height reduction optimization increases the instruction-level parallelism by changing the order of operations in a loop to keep the operation tree as low as possible.

For example, see the the following code.

for (int i = 0; i < N; ++i)
  x[i] = a0[i] + a1[i] + a2[i] + a3[i] + a4[i] + a5[i] + a6[i] + a7[i];

This is equivalent to the following code. Each addition depends on the result of the preceding addition.

for (int i = 0; i < N; ++i)
  x[i] = ((((((a0[i] + a1[i]) + a2[i]) + a3[i]) + a4[i]) + a5[i]) + a6[i]) + a7[i];

Tree height reduction transforms it to the following code. Additions in innermost parentheses can be executed in parallel.

for (int i = 0; i < N; ++i)
  x[i] = ((a0[i] + a1[i]) + (a2[i] + a3[i])) + ((a4[i] + a5[i]) + (a6[i] + a7[i]));

The implemented algorithm is based on the following paper:

  • Katherine Coons, Warren Hunt, Bertrand A. Maher, Doug Burger, Kathryn S. McKinley. Optimal Huffman Tree-Height Reduction for Instruction-Level Parallelism.

Applicable conditions

This patch incorporates tree height reduction pass into the default optimization pipeline but it is disabled by default. You need the -enable-int-thr option for integer operations (add, mul, and, or, and xor) and the -enable-fp-thr option for floating-point operations (fadd and fmul) to enable it. Furthermore, for floating-point operations, you also need reassoc and nsz flags to fadd/fmul.

You can use it via Clang by:

clang -O1 -fassociative-math -fno-signed-zeros -mllvm -enable-int-thr -mllvm -enable-fp-thr

Or, simply:

clang -Ofast -mllvm -enable-int-thr -mllvm -enable-fp-thr

It is only applied to operations in innermost loops.

Performance

I run C/C++ benchmarks in SPECspeed 2017 on Fujitsu A64FX processor, which has two pipelines for integer operations and SIMD/FP operations each. 600.perlbench_s and 619.lbm_s had 3% improvement. Other benchmarks (602, 605, 620, 623, 625, 631, 641, 644, 657) were within 1% up/down. In these runs, to emphasize the performance improvement, the number of OpenMP threads is limited to one.

Relation to D67383

This patch is an updated version of D67383. The author @masakazu.ueno was my colleague. I took over his patch.

The following comments in D67383 are addressed in this patch.

Also, this patch has following updates.

  • Adapt to the latest main branch (new pass manager, opaque pointer, Apache license, ...)
  • Remove the requirement of full fast-math flags
  • Fix bugs
  • Simplify the code
  • etc.

Future work

Currently the cost estimation is not implemented. Of course, as @hfinkel said, this optimization has no positive effect if the target processor cannot utilize ILP. I want to add some cost estimation or something and enable this optimization automatically when profitable.

And, I want to add a Clang option to enable/disable it.

These will be addressed in another patch.

Discussion

I want comments about the following points.

  1. This optimization is only applied to instructions in innermost loops because D67383 was implemented so. Should we expand it to instructions in all basic blocks?
  2. Where is the best position of this pass in the default optimization pipeline? I put it after the loop unrolling pass. Because, if a loop has a reduction, the reduction can also be optimized by this pass.
  3. As explained in the future work above, I want to add decision to apply the optimization or not. What information this decision should be based on? Issue width in a machine model?

Diff Detail

Event Timeline

kawashima-fj created this revision.Aug 29 2022, 12:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 12:44 AM
kawashima-fj requested review of this revision.Aug 29 2022, 12:44 AM

Maybe we can also add reverse mode to help GPU or Low-end CPU to reduce register pressure.

fhahn added a comment.Aug 30 2022, 2:36 AM

Thanks for updating the patch! Have you considered implementing this as MachineFunctionPass instead of an LLVM IR pass? Doing the transformation on MachineIR would allow for more precise cost estimates, including more accurate information about register usage, selected instructions and processor resource usage. MachineCombiner.cpp might be interesting example to look at for similar (although simpler) transformations with relatively accurate uarch-driven cost-modeling.

I run C/C++ benchmarks in SPECspeed 2017 on Fujitsu A64FX processor, which has two pipelines for integer operations and SIMD/FP operations each. 600.perlbench_s and 619.lbm_s had 3% improvement. Other benchmarks (602, 605, 620, 623, 625, 631, 641, 644, 657) were within 1% up/down. In these runs, to emphasize the performance improvement, the number of OpenMP threads is limited to one.

It might be interesting to also run SPECrate instead of just running speed with one thread?

Have you considered implementing this as MachineFunctionPass instead of an LLVM IR pass?

No. Just because the original patch D67383 was implemented as an LLVM IR pass and I'm not familier with MachineFunctionPass.

Doing the transformation on MachineIR would allow for more precise cost estimates, including more accurate information about register usage, selected instructions and processor resource usage. MachineCombiner.cpp might be interesting example to look at for similar (although simpler) transformations with relatively accurate uarch-driven cost-modeling.

Thanks for your advice. It seems to have an advantage but porting to MachineFunctionPass may take time for me. I'll see about it.

It might be interesting to also run SPECrate instead of just running speed with one thread?

OK. I'll do.

kawashima-fj abandoned this revision.Nov 16 2022, 2:32 AM

Have you considered implementing this as MachineFunctionPass instead of an LLVM IR pass?

No. Just because the original patch D67383 was implemented as an LLVM IR pass and I'm not familier with MachineFunctionPass.

Doing the transformation on MachineIR would allow for more precise cost estimates, including more accurate information about register usage, selected instructions and processor resource usage. MachineCombiner.cpp might be interesting example to look at for similar (although simpler) transformations with relatively accurate uarch-driven cost-modeling.

Thanks for your advice. It seems to have an advantage but porting to MachineFunctionPass may take time for me. I'll see about it.

Thank you very much for your information. I found MachineCombiner.cpp does similar transformations and adding missing opcodes to an existing function can achieve similar result for AArch64. The patch is D138107. Please review. I abandon this patch.

@kawashima-fj, I think it is a good optimization pass. I want to integrate it into my product. Is there any license issue since it is not taken by LLVM? Thanks

@kawashima-fj, I think it is a good optimization pass. I want to integrate it into my product. Is there any license issue since it is not taken by LLVM? Thanks

@seanxilinx No, there is no license issue. This patch was posted here under the license same as the LLVM (Apache License v2.0 with LLVM Exceptions).

As I explained in D138107, D132828 and D138107 achieve similar transformation. However, if the LLVM community wants both, I can continue this patch.

@kawashima-fj, I think it is a good optimization pass. I want to integrate it into my product. Is there any license issue since it is not taken by LLVM? Thanks

@seanxilinx No, there is no license issue. This patch was posted here under the license same as the LLVM (Apache License v2.0 with LLVM Exceptions).

As I explained in D138107, D132828 and D138107 achieve similar transformation. However, if the LLVM community wants both, I can continue this patch.

Thanks @kawashima-fj for your contribution.

llvm/lib/Transforms/Scalar/TreeHeightReduction.cpp
462

for the root of tree, we may not need to check whether it is isBranchCandidate or not.