This is an archive of the discontinued LLVM Phabricator instance.

[LV] Keep predicated instructions in the same block
AbandonedPublic

Authored by mssimpso on Nov 11 2016, 12:26 PM.

Details

Reviewers
mkuper
gilr
Summary

When predicating scalar instructions, we previously placed every instruction that requires predication into its own block, regardless of whether multiple instructions may have occurred in the same block in the original loop. For example, if a division and a store from the same block in the original loop require predication, we created a new basic block for each of the 2 x VF x UF scalar instructions in the vector loop.

This patch modifies code generation for instruction predication such that we keep the predicated instructions in the same block after vectorization if they were in the same block before vectorization.

Event Timeline

mssimpso updated this revision to Diff 77651.Nov 11 2016, 12:26 PM
mssimpso retitled this revision from to [LV] Keep predicated instructions in the same block.
mssimpso updated this object.
mssimpso added reviewers: mkuper, gilr.
mssimpso added subscribers: llvm-commits, mcrosier.
gilr added inline comments.Nov 13 2016, 10:06 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7099

Why force the creation of the edge masks here?

mkuper added inline comments.Nov 13 2016, 4:15 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
482

As long as you're changing this comment - this is no longer only for "un-vectorizable" instructions.

4319

This changes now, right?

4486

Do we generally prefer this to defining the vector inside the loop, from a performance standpoint? (The latter would be clearer, I think).

4488

What happens in the "Lane == 0 && Legal->isUniformAfterVectorization(I)" case?
I'm having a bit of trouble imagining what the code ends up looking.

mssimpso added inline comments.Nov 14 2016, 12:07 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
482

That's true! I'll update the comment.

4319

That's right; I missed this. I'll update this comment as well.

4486

Not that I'm aware of. I'm happy to move the vector definition inside the loop.

4488

This loop is collecting in VectorLoopPredInsts the scalarized instructions we produced in scalarizeInstruction. It then predicates all the scalarized instructions for a given unroll part and vector lane. If an instruction is uniform-after-vectorization we only generate values for Lane zero during scalarization, and getScalarValue asserts if we try to get a value for a Lane > 0.

However, as I'm thinking about this, I don't think we can ever end up with an instruction that requires predication that is also marked uniform-after-vectorization. Unless I'm missing something, we should probably just remove this if condition. What do you think?

7099

We have to create the edge masks in program order because they are cached in MaskCache. PHI widening, for example, also calls createEdgeMask. During PHI widening, if a mask isn't already in the cache, we will produce it at that time. But then when we go back to perform the actual predication, we would reuse the masks we created for the PHIs. But these masks may not dominate the branches we create for the predicated blocks, so we have to produce the masks in order. The existing tests in if-pred-non-void.ll require this.

mkuper added inline comments.Nov 14 2016, 1:40 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
4488

However, as I'm thinking about this, I don't think we can ever end up with an instruction that requires predication that is also marked uniform-after-vectorization. Unless I'm missing something, we should probably just remove this if condition. What do you think?

Yes, that's why I was having trouble imagining it. Couldn't figure out what a uniform-after-vectorization predicated instruction looks like.

I think we should remove this, and have an assert somewhere to make sure it really doesn't happen. (If we find out that does happen, I guess it's some edge-case we're not thinking about, so I'm not sure this code would do the right thing anyway.)

mssimpso updated this revision to Diff 77885.Nov 14 2016, 2:10 PM
mssimpso marked 4 inline comments as done.

Addressed comments from Michael and Gil.

  • Updated comments
  • Added assert for uniform-after-vectorization
  • Moved vector definition inside loop
  • Made some auto types explicit
gilr added inline comments.Nov 14 2016, 3:00 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
7099

Ahhh ... right.
This seems somewhat unclean, though: the patch moves block mask creation logic from scalarization to predication (which makes sense), but must leave some of it behind for caching reasons.
So as long as we're generating masks in program order for caching reasons, could the original createBlockInMask call be retained if the same caching was added for block masks? This would make it clear that it's the block mask we need generated here, avoid partly-duplicating its code and make the masks caching behavior more consistent (more of a getOrCreate API).

mssimpso added inline comments.Nov 15 2016, 7:20 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
7099

Sure, we can do that. I'll add a cache for the block-in masks and update the patch. Thanks!

mssimpso updated this revision to Diff 78006.Nov 15 2016, 8:26 AM
mssimpso marked an inline comment as done.

Addressed Gil's comments.

  • Added caching for block-in masks, and restored call to createBlockInMask in scalarizeInstruction
  • Updated comments
mkuper edited edge metadata.Nov 15 2016, 1:19 PM

LGTM, but please also wait for Gil re the masking change.

gilr edited edge metadata.Nov 15 2016, 3:17 PM

So committing to a single predicated block may be too aggressive at this point, for example in the following where the srem is moved into the predicated block and above the vectorized sub feeding it:

void foo(int* a, int b, int* c, int* d) {
  for (int i = 0; i < 10000; ++i) {
    int x = 333;
    if (a[i] > 777) {
      int t1 = a[i] / c[i];
      int t2 = b - t1;
      x = t2 % d[i];
    }
    a[i] += x;
  }
}

I'm not sure it would still serve the needs of the smarter-scalarization work, but as a standalone improvement to predication logic we can still try to combine predicated instructions into a mutual basic block wherever possible (e.g. reuse the last predicated basic block if possible, create a new one otherwise).

...and ignore my LGTM, Gil's right. :-)

Ah, you're right, Gil. Thanks very much for pointing this out. It seems this work will not be as straightforward as I thought, since there will be times when splitting the original block is unavoidable. I think it probably makes sense to table this patch for the moment and perhaps come back to it after the scalarization work. I'll finish addressing all the comments over at D26083. Thanks!

mssimpso abandoned this revision.Dec 7 2016, 7:32 AM

I'm abandoning this patch since the original approach was not correct. A new approach that combines predicated instructions into the same basic block when possible will likely look different enough from this patch that we should start a new review.