This is an archive of the discontinued LLVM Phabricator instance.

[LV] Sink scalar operands of predicated instructions
ClosedPublic

Authored by mssimpso on Oct 14 2016, 12:17 PM.

Details

Summary

When we predicate an instruction (div, rem, store) we place the instruction in its own basic block within the vectorized loop. If a predicated instruction has scalar operands, it's possible to recursively sink these scalar expressions into the predicated block so that they might avoid execution. This patch sinks as much scalar computation as possible into predicated blocks. We previously were able to sink such operands only if they were extractelement instructions.

For example, if we have a predicated store "a[i] = x", instead of generating:

vector.body:
  ...
  %i = add i64 %index, 1
  %p = getelementptr inbounds i32, i32* %a, i64 %i
  %x = extractelement <2 x i32> %vec, i32 0
  ...
pred.store:
  store i32 %x, i32* %p
  ...

We will now generate:

vector.body:
  ...
pred.store:
  %i = add i64 %index, 1
  %p = getelementptr inbounds i32, i32* %a, i64 %i
  %x = extractelement <2 x i32> %vec, i32 0
  store i32 %x, i32* %p
  ...

Diff Detail

Repository
rL LLVM

Event Timeline

mssimpso updated this revision to Diff 74728.Oct 14 2016, 12:17 PM
mssimpso retitled this revision from to [LV] Sink scalar operands of predicated instructions.
mssimpso updated this object.
mssimpso added reviewers: anemet, mkuper, gilr.
mssimpso added subscribers: llvm-commits, mcrosier.
mkuper edited edge metadata.Oct 14 2016, 1:33 PM

Why do this inside the vectorizer? Are we missing a later code sinking pass?

Why do this inside the vectorizer? Are we missing a later code sinking pass?

Hi Michael,

We actually do have an IR sinking pass (Transforms/Scalar/Sink.cpp), but it's not in the current pass pipeline. I've experimented with it in the past and observed some fairly large performance regressions when using it. It doesn't seem like the compiler is tuned for an "always sink where possible" pass. The sinking implemented in this patch is only aimed at the operands of the predicated instructions. It's really an extension of Gil's work for predicated instructions and all the scalarization we do now. We were previously only sinking the extractelement instructions we created when predicating, but we can do more.

The end goal is to get us closer to preserving the membership of the original predicated blocks if profitable, rather than always vectorizing and hoisting the instructions through if-conversion (aside from the stores, divs, and rems we can't vectorize). So we will need access to the cost model. The current patch sinks all the existing scalar operands, but the cost model may deem it more profitable for us to scalarize an instruction we otherwise would have vectorized, knowing that it is guaranteed to be sunk into a predicated block. I have a follow-on patch that does this.

With that in mind, my thoughts are that in the vectorizer, we (1) know the state of the program before and and after if-conversion. (i.e., was this instruction originally conditionally executed? If so, it might be good to leave it that way. Not, this instruction isn't need on all paths, so let's always sink it). And (2), we at least have some way to judge the profitability of this kind of decision.

Hope that helps (and makes sense)!

That makes sense, especially the cost angle.

On the other hand, my worry is that long-term, as this kind of thing accumulates, we may end up with an ad-hoc "organically grown" mini-optimizer inside the vectorizer. And I'd really like to try to avoid that. The fact this sinking patch requires D25631 to perform DCE inside the vectorizer only reinforces that concern.

So I'm not really sure what we should do. Adam, Gil, thoughts?

lib/Transforms/Vectorize/LoopVectorize.cpp
4279 ↗(On Diff #74728)

I think the standard way to do this is to run "while (!WorkList.empty()))" and keep a side map of nodes you're done with so you will not add to the worklist. This avoids both the iterator invalidation problem, and recognizing the fixed-point problem.
It may make things more complicated in this case, though.

4291 ↗(On Diff #74728)

Is removing from the middle of a SetVector O(N)?
I'd expect this case to happen a lot.

(I understand you're doing it to keep Idx constant while making the Worklist smaller. As above, perhaps change the structure of the loop?)

4304 ↗(On Diff #74728)

PredBB->getFirstInsertionPt() (which is essentially the same thing, but nicer. :-) )

dorit added a subscriber: dorit.Oct 18 2016, 12:43 AM
gilr edited edge metadata.Oct 18 2016, 3:47 AM

Hi Michael,

On the other hand, my worry is that long-term, as this kind of thing accumulates, we may end up with an ad-hoc "organically grown" mini-optimizer inside the vectorizer. And I'd really like to try to avoid that.

So this is a constant dilemma (e.g. in D21620). It's not just the question of whether a later pass will optimize but also of how to cost-model later hypothetical optimizations. I agree there's a slippery slope here, so we need to estimate the ROI per case.
I think this patch still falls under the "Generate no junk knowingly" category. Letting a later pass perform the actual sinking won't save us from performing the analysis in favor of the cost model's accuracy. The implementation is focused on the predicated instructions and is encapsulated in a single function, so it doesn't clutter the code (hope this will also hold for cost-model changes). I say Aye for this one.

The fact this sinking patch requires D25631 to perform DCE inside the vectorizer only reinforces that concern.

Since D25631 fixes a somewhat sloppy behavior of the vectorizer I think it also falls under "Generate no junk knowingly", so it could stand on its own as part of the vectorizer's induction variable handling logic.

Michael/Gil,

Thanks very much for the feedback and discussion! I'll go ahead and work on updating the patch according to Michael's suggestions.

That makes sense, especially the cost angle.

On the other hand, my worry is that long-term, as this kind of thing accumulates, we may end up with an ad-hoc "organically grown" mini-optimizer inside the vectorizer. And I'd really like to try to avoid that. The fact this sinking patch requires D25631 to perform DCE inside the vectorizer only reinforces that concern.

Yes, I agree a mini-optimizer inside the vectorizer is not something we want. But to clarify things somewhat, we're not really optimizing existing code as much as we are teaching the vectorizer to not emit poor code, or to "knowingly generate junk", to paraphrase Gil. In D25631, we're not actually performing DCE. In fact, by not emitting would-be dead code, we're preventing a later DCE pass from having to work as much.

We certainly shouldn't try and reinvent the wheel when it isn't necessary. If a standard pass in the pipeline can clean up after the vectorizer (and the compile time penalty from holding on to and churning over the sub-optimal code we generate is not a concern), we should let it do so, in favor of simplicity in the vectorizer. We don't have a pass that does this though (in the limited sense of the predicated instructions), and there's value in letting the cost model guide the future scalarization/sinking choices we can make.

In D25632#572696, @gilr wrote:

It's not just the question of whether a later pass will optimize but also of how to cost-model later hypothetical optimizations.

This is definitely true, but it can actually get a little worse! Later passes can use their own cost model computed over the "junk" the vectorizer generates. For example, in the LTO pipeline, loop unrolling is run after the vectorizer, but before InstCombine cleans up the code. We should probably fix this.

lib/Transforms/Vectorize/LoopVectorize.cpp
4279 ↗(On Diff #74728)

Sure, I'll restructure the loop. And you're right, removing from the middle of SetVector is not constant-time, so we should avoid that. Thanks!

4304 ↗(On Diff #74728)

Sounds good!

Ok, consider me convinced. :-)

gilr added inline comments.Oct 18 2016, 12:07 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
4308 ↗(On Diff #74728)

Double "have" in comment.

mssimpso updated this revision to Diff 75060.Oct 18 2016, 12:58 PM
mssimpso edited edge metadata.
mssimpso marked 4 inline comments as done.

Addressed comments from Michael and Gil. Thanks!

  • Restructured the loop.
  • Updated comments.
gilr accepted this revision.Oct 19 2016, 10:20 AM
gilr edited edge metadata.

LGTM with a nit (inline)

lib/Transforms/Vectorize/LoopVectorize.cpp
4299 ↗(On Diff #75060)

Nit - maybe move mayHaveSideEffects() to the end here? the other tests seem much lighter (to a lesser extent - also for Loop->contains()).

This revision is now accepted and ready to land.Oct 19 2016, 10:20 AM
mssimpso added inline comments.Oct 19 2016, 1:45 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
4299 ↗(On Diff #75060)

Sure, sounds good. I'll reorder the conditions. Thanks!

Michael, did you have any comments about the new loop structure?

mkuper accepted this revision.Oct 21 2016, 1:29 PM
mkuper edited edge metadata.

Sorry, lost track of this, thanks for pinging.
LGTM.

This revision was automatically updated to reflect the committed changes.