Page MenuHomePhabricator

[LSR] Generate formulae to enable more indexed accesses
ClosedPublic

Authored by samparker on Dec 6 2018, 8:08 AM.

Details

Summary

Modify GenerateConstantOffsetsImpl to create offsets that can be used by indexed addressing modes. If formulae can be generated which result in the constant offset being the same size as the recurrence, we can generate an indexed access.

The resulting code, at least for Arm, is that usually pre-indexed loads are used as the last access, but sometimes the first.

@kparzysz Would you be able to provide feedback on how this effects Hexagon? It's a target that I don't build and I haven't looked at the tests, but I'm assuming this would interest you.

Diff Detail

Repository
rL LLVM

Event Timeline

samparker created this revision.Dec 6 2018, 8:08 AM
gilr added inline comments.Dec 9 2018, 8:49 AM
test/CodeGen/ARM/loop-align-cortex-m.ll
3 ↗(On Diff #176977)

You're just fixing this by the way, right?

test/Transforms/LoopStrengthReduce/ARM/complexity.ll
10 ↗(On Diff #176977)

Shouldn't -DEFAULT be removed here too?

samparker marked 2 inline comments as done.Dec 11 2018, 12:51 AM
samparker added inline comments.
test/CodeGen/ARM/loop-align-cortex-m.ll
3 ↗(On Diff #176977)

yes

test/Transforms/LoopStrengthReduce/ARM/complexity.ll
10 ↗(On Diff #176977)

thanks!

I've run some tests and the results are not great for us. On some tests we got up to 5.5% improvements, but there are a lot of severe degradations (15+% worse). If this patch goes in, we'd like to be able to opt out.

Okay, thanks. We're also seeing some regressions, so I know I've got some tuning to do. Do you have any idea of the characteristics of your regressions? At the moment I'm thinking:

  • That the costs that I've added here are overly simplistic, for one I think I need to add a setup cost.
  • It's also probably not worth doing when we know that the loop iteration count is low.
  • In the current state, we also see code size regressions whereas your previous work helps us reduce code size. It may mean that I'll need a different flag to enable this change, but it also maybe a symptom of the performance regressions.
gilr added inline comments.Dec 12 2018, 12:13 AM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
3793 ↗(On Diff #176977)

Is this relevant for non-Address kind formulae?

3793 ↗(On Diff #176977)

Worth a comment here to describe the motivation and how adjusting the offset generates the post-inc opportunity.

3795 ↗(On Diff #176977)

How does a non-constant step (e.g. 50 + %x) translate to post-inc? Could you add such a test case?

samparker marked 2 inline comments as done.Dec 12 2018, 1:09 AM
samparker added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
3793 ↗(On Diff #176977)

No. I had naively assumed that this function was only generating formulae for addresses... so if that's not the case, I'll make the change here.

3795 ↗(On Diff #176977)

Ah, good catch. This should only trigger for constant steps.

samparker updated this revision to Diff 177872.Dec 12 2018, 9:53 AM

I've moved the logic under the control of a new TTI flag as it seems that the current shouldFavourPostInc is trying to achieve different things. Hopefully I've also addressed Gil's comments.

gilr added inline comments.Dec 16 2018, 5:56 AM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
1258 ↗(On Diff #177872)
  • The existing {LI, +, C} pattern seems to already match this case (i.e. {(-C + %a + ...), +, C}) and the more general case {(Offset - C2) + %a + ...), +, C2}, where Offset =/= 0. So IIUC matching this pattern here is only needed if the new TTI flag is set but the existing one is reset, right?
  • Won't this also match something like {3, +, 5, +, %x}?
1384 ↗(On Diff #177872)

IIRC single-statement clauses shouldn't get curly braces.

3802 ↗(On Diff #177872)

The new TTI API relates to the same HW feature, right? Why not use a cl::opt in LSR to turn this optimization off?

samparker marked an inline comment as done.Dec 19 2018, 1:27 AM
samparker added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
1258 ↗(On Diff #177872)

I'm going to need to replace this with something that compares the step with the base offset, aiming to have the post increment happen on the last, not the first, access. I need to make a few other changes to support this though as well.

samparker marked an inline comment as done.Dec 19 2018, 4:49 AM
samparker added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
3802 ↗(On Diff #177872)

When I've got the next patch ready, I will re-run my tests and see if this is viable. It may not be good for all targets, and if so, I think a target hook would be more useful.

samparker updated this revision to Diff 181251.EditedJan 11 2019, 5:02 AM
  • Added a command line option to 'EnableBackedgePostIncs'.
  • Added a command line option to enable narrowing the search space by collapsing unrolled code.
  • The TTI hook now accepts the loop so that the target can make a more informed decision on when it 'shouldFavorBackedgePostIncs'.
  • LSRInstance contains a boolean 'FavorBackedgePostInc' which is equal to EnableBackedgePostIncs && shouldFavorBackedgePostIncs.
  • When FavorBackedgePostIncs:
    • Generate the new constant offsets.
    • In RateRegister, the LoopCost is now set to 0 if the step recurrence is equal to the base offset of the parent formula.
    • IsProfitableChain has a higher limit
    • The last expression in the IVChain is not added to IVIncSet so it's a target for optimising.
samparker retitled this revision from [LSR] Generate formulae to enable more post-incs to [LSR] Generate formulae to enable more indexed accesses.Jan 11 2019, 7:22 AM
samparker edited the summary of this revision. (Show Details)
samparker updated this revision to Diff 181270.Jan 11 2019, 7:42 AM

some renaming, renamed post inc bits to 'index'.

samparker updated this revision to Diff 181297.Jan 11 2019, 9:22 AM

Changed the default value for the command line option 'EnableBackedgeIndexing' to false.

samparker updated this revision to Diff 181776.Jan 15 2019, 6:44 AM

Removed changes to the complexity.ll test.

ping

Are you still seeing regressions? Can you provide performance data with the current patch?

Hi Hal,

I've attached a graph to show geomean performance results of a popular embedded benchmark suite running on Arm microcontroller. The suite is broken into five sub-suites and only three of which are affected by these changes. The large regression comes from a test which often exhibits bi-modal behaviour, so I could be getting unlucky. The whole suite comprises of several dozen tests and I only four of the benchmarks have minor regressions across the three runs. There's also some backend work that I need to do for Arm to get this optimised properly.

gilr added inline comments.Jan 28 2019, 5:52 AM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
159 ↗(On Diff #181776)

False by default?

4466 ↗(On Diff #181776)

The convention for flags controlling the narrowing heuristics seem to be to use them in NarrowSearchSpaceUsingHeuristics() rather than in the the functions they affect.

4466 ↗(On Diff #181776)

I think you meant '||' here.

samparker marked an inline comment as done.Jan 29 2019, 1:20 AM
samparker added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
159 ↗(On Diff #181776)

This option has been introduced to force the collapsing, even if the complexity limit hasn't been reached. Which is why it's implemented differently to the other, similar, options.

gilr added inline comments.Feb 3 2019, 2:27 PM
lib/Transforms/Scalar/LoopStrengthReduce.cpp
159 ↗(On Diff #181776)

Aaargh, sorry, read it backwards ...
Is it needed since unrolled code doesn't initially go into the same LSRUse? If so, can FavorBackedgeIndex be used to have initial construction put them in the same LSRUse?

2904 ↗(On Diff #181776)

Deserves a comment.

3121 ↗(On Diff #181776)

I'm a bit confused here. IIUC a profitable complete chain prvides an efficient solution using post-increments. Why break the last increment out?

samparker marked 2 inline comments as done.Feb 4 2019, 3:36 AM
samparker added inline comments.
lib/Transforms/Scalar/LoopStrengthReduce.cpp
159 ↗(On Diff #181776)

I will look into this, thanks.

3121 ↗(On Diff #181776)

This was so that the last access could also be a target for optimisation. I'll remove the change and do some testing again to see the difference.

samparker updated this revision to Diff 185268.Feb 5 2019, 3:13 AM

Made some simplifications:

  • reset isProfitableChain.
  • reset FinalizeChain so that the tail is added to the chain again.
  • removed the CollapseUnrolled option because with the reset in changes, it wasn't really interesting.

I'll post the updated performance numbers.

gilr added inline comments.Feb 6 2019, 12:40 PM
include/llvm/Analysis/TargetTransformInfo.h
489 ↗(On Diff #185268)

Please add a doxygen comment.

lib/Target/ARM/ARMTargetTransformInfo.h
97 ↗(On Diff #185268)

Is this optimization inherently code-size unfriendly for ARM? (The patch actually reduces the instruction count in LSR's Cost when this optimization kicks in)

99 ↗(On Diff #185268)

Is the single-block constraint due to CodeGen's single-block optimization scope? (If so, then IINM it's not target-specific)

samparker marked 2 inline comments as done.Feb 7 2019, 1:35 AM
samparker added inline comments.
lib/Target/ARM/ARMTargetTransformInfo.h
97 ↗(On Diff #185268)

There's two reasons really: the transform is most useful in 'unrolled' loops (which we disable when optimising for code size), and this transform will introduce instructions into the preheader and if the address can't be kept in the same register, we'll also produce moves. So this is mainly a defensive restriction, because I haven't been tracking code size, but I would hope that I can remove the restriction later.

99 ↗(On Diff #185268)

No, its not because of ISel restrictions or anything like that. It's because the transform is only likely to be useful is the address can be kept in the same register - which becomes increasingly less likely once multiple blocks are considered.

samparker updated this revision to Diff 185722.Feb 7 2019, 1:40 AM

Added doxygen comment.

gilr added inline comments.Feb 7 2019, 2:26 AM
lib/Target/ARM/ARMTargetTransformInfo.h
99 ↗(On Diff #185268)

So both the code-size and single-block heuristics seem target-independent. Why not do this in LSR?

samparker marked an inline comment as done.Feb 7 2019, 2:58 AM
samparker added inline comments.
lib/Target/ARM/ARMTargetTransformInfo.h
99 ↗(On Diff #185268)

I'd argue because different backends would come to different conclusions to me. I've gone for a very simplistic heuristic, but it would be good to consider register pressure rather than just the number of blocks. Also, some targets may not support indexed accesses on certain types of memory operations and so it worthless generating formulae for them.

gilr accepted this revision.Feb 7 2019, 4:28 AM

LGTM!
(One last nitpick: I'd consider letting EnableBackedgeIndexing default to true as TTI's default already disables it for Hexagon and all other targets)

This revision is now accepted and ready to land.Feb 7 2019, 4:28 AM

Will do. Thanks for the review!

Closed by commit rL353403: [LSR] Generate cross iteration indexes (authored by sam_parker, committed by ). · Explain WhyFeb 7 2019, 5:33 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2019, 5:33 AM