This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Query TTI when deciding to splat IV
AbandonedPublic

Authored by samparker on Aug 15 2016, 6:46 AM.

Details

Summary

Recently, after D20474, we observed regressions in our internal testing because of changes to induction variable calculations. This patch implements a TargetTransform hook to query the backend for whether splatting a scalar is expensive and returns true for ARM. This enables the vectorization of integer induction phi nodes instead of using a scalar and splatting it for each loop iteration.

Diff Detail

Event Timeline

samparker updated this revision to Diff 68032.Aug 15 2016, 6:46 AM
samparker retitled this revision from to [LoopVectorize] Query TTI when deciding to splat IV.
samparker updated this object.

Ping.

Would someone be able to review this please?

anemet added a reviewer: wmi.Aug 18 2016, 9:15 AM
anemet added subscribers: wmi, anemet.

Adding @wmi, who made the original change.

wmi added a reviewer: mkuper.Aug 19 2016, 9:27 AM
wmi edited edge metadata.

+michael, he is the author of the code.

Sam, could you add a testcase for the regression? I am curious why such regression is arm specific.

mkuper added a subscriber: mssimpso.

Adding @mssimpso, who improved the logic significantly after my original patch. :-)

And yes, as @wmi said, this needs a regression test. Also, could you explain why this is an ARM-specific issue? I want to make sure we're not talking about a case where we should never be using a scalar IV, regardless of platform.

samparker updated this revision to Diff 68847.Aug 22 2016, 5:45 AM
samparker edited edge metadata.

Added a regression test

Hi Michael,

The splatting of the indvar seems to be problematic for ARM because a VDUP instruction gets generated, which has a long latency,

mssimpso edited edge metadata.Aug 22 2016, 6:37 AM

Hi Everyone,

I'm planning to take a look at this case today. I'll update the thread in a few hours.

Matt.

Hi Sam,

I don't think TTI is the right way to fix this. The underlying issue in the test case is that we're vectorizing instructions marked uniform after vectorization. We should be scalarizing them instead.

This is what happens. We first collect the loop uniforms. The GEPs are marked uniform, as well as the arithmetic producing their indices. Because the IVs have no non-uniform users, they too are marked uniform, which all seems correct to me.

Then we vectorize the IVs. When vectorizing them, we know that we should only need scalar versions (Legal->isScalarAfterVectorization is true because the IVs are uniform). However, to keep WidenMap happy (see D23169) we actually vectorize them with the splat method as well. This is not ideal but also not the underlying issue. Because all users of the IVs should also be scalar, it's OK (but not good) that we vectorize them with the splat method since the vector versions would have no users and would be deleted later on in the clean-up phase.

Of course, the problem with this is that we actually still are vectorizing the uniform users of the IVs, creating uses of the vector IVs that we later can't remove. We shouldn't be doing this. I think the right fix is to have something like the following in each case in vectorizeBlockInLoop (except for PHIs and branches):

if (Legal->isScalarAfterVectorization(&I)) {
  scalarizeInstruction(&I);
  continue;
}

If we've already decided that an instruciton should remain scalar, we shouldn't vectorize it. Does this make sense? We will probably want to wait to do this until after D23169 has been committed to avoid large increases in compile-time. I tested the above approach (without D23169) on your test case and the ugly vector IVs were removed. Would you mind doing the same to see if your performance is restored?

Matt.

Can we please also have this loop explained at a higher level? The IR in the testcase is not really digestible.

Of course, the problem with this is that we actually still are vectorizing the uniform users of the IVs, creating uses of the vector IVs that we later can't remove. We shouldn't be doing this. I think the right fix is to have something like the following in each case in vectorizeBlockInLoop (except for PHIs and branches):

if (Legal->isScalarAfterVectorization(&I)) {
  scalarizeInstruction(&I);
  continue;
}

If we've already decided that an instruciton should remain scalar, we shouldn't vectorize it. Does this make sense?

It's pretty odd that we don't already do this.
I assume the reason we don't already get a lot of garbage is because we have LICM before LV?

Can we please also have this loop explained at a higher level? The IR in the testcase is not really digestible.

I can reproduce this with something like:

void foo(int *p, int *q, short k) {
  for (short i = 0; i < 256; ++i)
    p[i + k] = q[i + k - 1] * 5 + 1;
}

If we only have, say, "i + k" as the IV user, instcombine can clean this up, but it seems like it can't handle more complex cases. Haven't dug into why.

Hi Matt,

I can confirm that your approach works for me :)

Hi Matt,

I can confirm that your approach works for me :)

Great! I hope you don't mind that a took a stab at implementing this approach over in D23889. We should now be able to scalarize instructions that LVL marks scalar after vectorization without increasing code size pre-instcombine. To do this we can check within the scalarization logic if an instruction is also uniform, and if so, only generate one scalar value per unroll iteration instead of VF values per iteration.

samparker abandoned this revision.Aug 26 2016, 1:03 AM