This is an archive of the discontinued LLVM Phabricator instance.

Add new optimization pass of Tree-Height-Reduction
Needs RevisionPublic

Authored by masakazu.ueno on Sep 9 2019, 6:49 PM.

Details

Summary

The tree-height-reduction optimization increases the instruction-level parallelism by changing the order of calculations in a loop to keep the
calculation tree as short as possible.
For example, the following code is improved.
for (int i = 0; i < N; ++i){ a[i] = b[i] + c[i] + d[i] + e[i] + f[i] + g[i] + h[i] + k[i] ; }

  • Conditions for optimization
    • Integer arithmetic: -O1 or higher is effective
    • Floating point arithmetic -O1 or higher is effective, AND -ffast-math is effective, AND -mllvm -enable-fp-thr is effective

Diff Detail

Event Timeline

masakazu.ueno created this revision.Sep 9 2019, 6:49 PM

@masakazu.ueno not directly related to this patch, but i'm just wondering if this is something that could help with the similar pattern/question in https://bugs.llvm.org/show_bug.cgi?id=43146#c8

mgrang added inline comments.Sep 10 2019, 3:10 PM
lib/Transforms/Scalar/TreeHeightReduction.cpp
654

Please use the range-based llvm::sort function here. See https://llvm.org/docs/CodingStandards.html#beware-of-non-deterministic-sorting-order-of-equal-elements.

llvm::sort(Leaves);
hfinkel added inline comments.
lib/Transforms/Scalar/TreeHeightReduction.cpp
379

This is the only call to TTI that I see, and I don't see it don't see it documented anywhere what objective function this transformation is effectively using. Can you please add some explanation of that? Also, I'd expect to see some check on the instruction throughput -- because it might not make sense to introduce more ILP than the target can use -- and I'd expect to see some register-pressure estimate.

lkail added a subscriber: lkail.Sep 10 2019, 6:42 PM
xbolva00 added inline comments.Sep 11 2019, 3:47 AM
test/Transforms/TreeHeightReduction/AArch64/fp16-mult.c
1

This should be fp16-mult.ll, not fp16-mult.c

Please adjust tests to test only this new pass, not the -O1 pipeline.

troyj added a subscriber: troyj.Nov 18 2019, 1:47 PM
fsaintjacques requested changes to this revision.Dec 30 2019, 12:08 PM

I would appreciate if bitwise operators were supported, see my comments and example (could it be added in the unit test?)

lib/Transforms/Scalar/TreeHeightReduction.cpp
506

This should also work with bitwise instructions (AND, OR, XOR). I'm writing a small query language for bitmaps and was wondering if such re-balancing optimization existed. See example for IR which would benefit from this.

This revision now requires changes to proceed.Dec 30 2019, 12:08 PM
fsaintjacques added inline comments.Jan 9 2020, 10:07 AM
lib/Transforms/Scalar/TreeHeightReduction.cpp
73

The not is misleading as this is disabled by default.

379

@hfinkel I'd like to see this optimization in the next release. I don't have the time and knowledge to make the changes you request. Is it an acceptable solution if I disable this optimization in all levels, but make it accessible to front-end languages?

The cost estimation would be a followup task.

test/Transforms/TreeHeightReduction/AArch64/fp16-mult.c
1

I assume it's the same for all other files.

marksl added a subscriber: marksl.Jun 7 2021, 10:03 AM
fhahn added a comment.Jun 21 2021, 7:03 AM

@masakazu.ueno / @fsaintjacques are you still interested in pushing this forward? We recently had a couple of user reports where -reassoicate causes large regressions in hand-tuned code written with intrinsics, because it linearizes expressions that have been carefully written to exploit maximal parallelism, e.g. see https://godbolt.org/z/e6fWcbdjn

A general pass to reduce the height of compute expressions would certainly by very valuable for such cases.

A revised patch is posted as D132828. Please review it.

Herald added a project: Restricted Project. · View Herald TranscriptAug 29 2022, 12:47 AM