This is an archive of the discontinued LLVM Phabricator instance.

Add straight-line strength reduction to LLVM
ClosedPublic

Authored by jingyue on Jan 30 2015, 5:13 PM.

Details

Summary

Straight-line strength reduction (SLSR) is implemented in GCC but not yet in
LLVM. It has proven to effectively simplify statements derived from an unrolled
loop, and can potentially benefit many other cases too. For example,

LLVM unrolls

#pragma unroll
foo (int i = 0; i < 3; ++i) {
  sum += foo((b + i) * s);
}

into

sum += foo(b * s);
sum += foo((b + 1) * s);
sum += foo((b + 2) * s);

However, no optimizations yet reduce the internal redundancy of the three
expressions:

b * s
(b + 1) * s
(b + 2) * s

With SLSR, LLVM can optimize these three expressions into:

t1 = b * s
t2 = t1 + s
t3 = t2 + s

This commit is only an initial step towards implementing a series of such
optimizations. I will implement more (see TODO in the file commentary) in the
near future. This optimization is enabled for the NVPTX backend for now.
However, I am more than happy to push it to the standard optimization pipeline
after more thorough performance tests.

Diff Detail

Event Timeline

jingyue updated this revision to Diff 19075.Jan 30 2015, 5:13 PM
jingyue retitled this revision from to Add straight-line strength reduction to LLVM.
jingyue updated this object.
jingyue edited the test plan for this revision. (Show Details)
jingyue added a subscriber: Unknown Object (MLST).
hfinkel edited edge metadata.Jan 31 2015, 12:37 AM

Thanks for working on this! We need to be careful here, however, because:

  1. We don't want to turn "free" computations into non-free computations (because we remove the ability to do address-mode folding). LoopStrengthReduce uses part of the TTI interface for dealing with target addressing modes, and we should do that here too.
  2. We don't want to lengthen the critical path by unnecessarily decreasing the amount of available ILP. We don't currently have a good interface for asking a target how much ILP is available for simple integer operations (we have one for vectorization interleaving factors, which is similar in spirit, but on several targets with which I work, use a different number than would be appropriate here).

Regarding (2), we could also decide the long chain is the canonical form, and targets should split these late for ILP. I've not given this a lot of thought, and so I'm not sure how easy or hard that might be. Using the MachineCombiner might be a good place to do such splitting.

What do you think?

lib/Transforms/Scalar/StraightLineStrengthReduce.cpp
11

We don't need to mention GCC here.

182

You can remove the "TODO" part of this. There are a lot of things in LLVM that only handle the canonical operand ordering.

210

If (!C.Ins->getParent()) is fine. Also no { } needed.

Hal, thanks for the comments!

(1) Given the cases I consider so far, I am not aware of any target that can fold (B + i) * S where B and S are variables into an addressing mode. I don't think such form can even fit into LLVM's AddrMode struct. However, once we start considering more cases (e.g., B + i * S), we should definitely check for addressing modes. Does this make sense?

(2) Thanks for pointing this out. I hadn't considered this issue when preparing this diff. I now understand your concern, but don't have a good solution yet. I'll investigate how LSR handles this issue and how MachineCombiner works to get a better picture. Any suggestions from other folks are appreciated!

Jingyue

atrick accepted this revision.Feb 2 2015, 2:56 PM
atrick edited edge metadata.

Mainly, I would like to reiterate both of Hal's very good points above.

Out of curiosity, can you explain what sort of real-world source code results in this IR pattern?

You should keep in mind, that this level of optimization is currently the job of LSR. If you're doing this within a loop the IVChain part of LSR will try to do something similar.

I only reviewed it quickly, but your implementation looks good. Thank you for putting a bound on the number of candidate checks. I hate finding those quadratic behaviors later when compile time blows up.

LGTM given that it's not enabled in the standard pipeline yet.

This revision is now accepted and ready to land.Feb 2 2015, 2:56 PM

Hi Andrew,

The code example in my file commentary

#pragma unroll
foo (int i = 0; i < 3; ++i) {
  sum += foo((b + i) * s);
}

pretty much reflects real-world source code. More often, I see (b + i) * s used as an array index which creates the same issue.

LSR cannot yet solve the problem I'm facing here due to phase ordering. Loop unrolling, which happens way before LSR, destroys the loop structure. I agree with you that, if the loop is not unrolled, LSR should take care of this pattern.

meheff edited edge metadata.EditedFeb 2 2015, 4:26 PM

LGTM

lib/Transforms/Scalar/StraightLineStrengthReduce.cpp
166

Since there can be more than one candidate would there ever be any advantage to looking deeper in the list? Some bases could be cheaper (eg, where i' - i == 1 and you don't need a multiply). However, I'd guess in the common unrolled case the latest candidate on the list is best. On the other hand though, from an ILP perspective the earliest candidate might be best.

jingyue updated this revision to Diff 19211.Feb 2 2015, 10:15 PM
jingyue edited edge metadata.

Disable SLSR if the target schedules for ILP.

Also, added one more test where the bump is a multiple of S.

atrick requested changes to this revision.Feb 2 2015, 10:20 PM
atrick edited edge metadata.

I don't like piggybacking on the scheduling mode. When a target selects the scheduler it doesn't expect to get different IR out of the optimizer. Don't surprise users.

It would be fine to add a separate hook for this pass. But I don't think it's an issue until the pass goes into the standard pipeline.

This revision now requires changes to proceed.Feb 2 2015, 10:20 PM
jingyue added inline comments.Feb 2 2015, 10:25 PM
lib/Transforms/Scalar/StraightLineStrengthReduce.cpp
11

done

166

We may want to improve the strategy in the future. The patterns I have seen so far are derived from loop unrolling, and in those cases choosing the latest candidate very likely produces the best code. Also note that in the cases where the value grows by a multiple of the stride, e.g.,

b * s
(b + 2) * s
(b + 4) * s

choosing the latest candidate leads to a uniform bump (2 * s in this example) for every iteration, and this bump can be GVN'ed later.

182

done

210

done

jingyue updated this revision to Diff 19212.Feb 2 2015, 10:42 PM
jingyue edited edge metadata.

stop piggybacking scheduling info

Andrew, thanks for pointing this out! I reverted my change on using scheduling info, and left a TODO in the file comment.

atrick accepted this revision.Feb 2 2015, 10:44 PM
atrick edited edge metadata.

lgtm

This revision is now accepted and ready to land.Feb 2 2015, 10:44 PM
jholewinski accepted this revision.Feb 3 2015, 6:57 AM
jholewinski edited edge metadata.

LGTM! Thanks for working on this!

jingyue updated this revision to Diff 19252.Feb 3 2015, 11:25 AM
jingyue edited edge metadata.

Minor changes, NFC:
Merge from upstream
Do not put usernames in TODO

Hal, do you think this patch is good to go, given we don't push it to the standard pipeline yet? I left a TODO on the ILP issue you raised. I like your point a lot, but I feel I should do it in a separate diff.

Hal, do you think this patch is good to go, given we don't push it to the standard pipeline yet? I left a TODO on the ILP issue you raised. I like your point a lot, but I feel I should do it in a separate diff.

Yes, that's fine with me. Let's iterate in-tree. Please proceed.

jingyue closed this revision.Feb 3 2015, 11:38 AM